Valid XHTML     Valid CSS2    

PREDIPATH predictions using the predipath-DB presence/absence data

Clickable table of contents

  1. Preparation and description of the data

  2. Prediction for the P and NP classes

  3. Concluding remarks

1. Preparation and description of the data

Data was given as a table of 65 lines by 65 columns. We kept only the 58 genomes (not 59!) that were annotated NPA-NP, PA-NP or PA-P (NPA: non plant associated, PA: plant associated, NP: non pathogenic, P: pathogenic). The columns were originally named cluster_1, cluster_2... We shorten these names in CL01, CL02... :

We then decided to shorten and noramlize the names of the columns. Here is the short description of the columns:


     Name          SUM
     CL01            1
     CL02            1
     CL03            4
     CL04            1
     CL05           58
     CL06            1
     CL07            1
     CL08            1
     CL09            1
     CL10            1
     CL11           57
     CL12            1
     CL13           10
     CL14            5
     CL15           22
     CL16            6
     CL17            2
     CL18           44
     CL19            1
     CL20            1
     CL21           55
     CL22            1
     CL23            1
     CL24           14
     CL25           30
     CL26           32
     CL27           30
     CL28           33
     CL29           33
     CL30           29
     CL31           33
     CL32           33
     CL33           37
     CL34           39
     CL35           40
     CL36           39
     CL37           34
     CL38           33
     CL39           37
     CL40           32
     CL41           42
     CL42           37
     CL43           38
     CL44           37
     CL45           41
     CL46           37
     CL47           37
     CL48           36
     CL49           39
     CL50           40
     CL51            2
     CL52            2
     CL53            1
     CL54            2
     CL55            1
     CL56            1
     CL57            1
     CL58            1
     CL59            1
     CL60            1
     CL61            1
     CL62            1
     CL63            1
     CL64            1
     CL65           39
     

With a small R script, it was easy to get rid of 1 constant column named CL05.

With another small R script, it was easy to get rid of 35 equivalent or redundant columns


        Column  Equal
     1  CL01    CL02 CL04 CL06 CL07 CL08 CL09 CL10 CL12 CL23 CL53 CL55
     3  CL03
     10 CL11
     12 CL13
     13 CL14
     14 CL15
     15 CL16
     16 CL17    CL52
     17 CL18
     18 CL19    CL20
     20 CL21
     21 CL22
     23 CL24
     24 CL25
     25 CL26    CL40
     26 CL27
     27 CL28    CL29 CL31 CL32 CL38
     29 CL30
     32 CL33    CL39 CL42 CL44 CL46 CL47
     33 CL34    CL36 CL49 CL65
     34 CL35    CL50
     36 CL37
     40 CL41
     42 CL43
     44 CL45
     47 CL48
     50 CL51
     53 CL54
     55 CL56    CL57 CL58 CL59 CL60 CL61 CL62 CL63 CL64
     

For the remaining 29 columns, we removed 8 columns with a near zero variance:


          names2rm
     [1,] "CL01"
     [2,] "CL11"
     [3,] "CL17"
     [4,] "CL19"
     [5,] "CL22"
     [6,] "CL51"
     [7,] "CL54"
     [8,] "CL56"
     

So for the remaining part of the analysis, we had only 21 columns and 58 genomes in three classes:


     
     Description des  3 classes pour les  58 génomes
     ===============================================
     
            Num Effectif Pourcentage
     NPA-NP   0        7        12 %
     PA-NP    1        9        16 %
     PA-P     2       42        72 %
     
     Comptages par classe pour les  21 colonnes de données présence/absence dans predipath-db
     ========================================================================================
     
          NPA-NP PA-NP PA-P total
     CL03      2     1    1     4
     CL13      4     5    1    10
     CL14      3     2    0     5
     CL15      7     6    9    22
     CL16      4     2    0     6
     CL18      3     2   39    44
     CL21      6     9   40    55
     CL24      0     0   14    14
     CL25      0     0   30    30
     CL26      0     0   32    32
     CL27      0     0   30    30
     CL28      0     0   33    33
     CL30      0     0   29    29
     CL33      0     0   37    37
     CL34      0     1   38    39
     CL35      0     1   39    40
     CL37      0     0   34    34
     CL41      0     1   41    42
     CL43      0     0   38    38
     CL45      0     1   40    41
     CL48      0     0   36    36
     
     Pourcentages par classe
     ========================
     
            NPA-NP    PA-NP     PA-P
     CL03     29 %     11 %      2 %
     CL13     57 %     56 %      2 %
     CL14     43 %     22 %      0 %
     CL15    100 %     67 %     21 %
     CL16     57 %     22 %      0 %
     CL18     43 %     22 %     93 %
     CL21     86 %    100 %     95 %
     CL24      0 %      0 %     33 %
     CL25      0 %      0 %     71 %
     CL26      0 %      0 %     76 %
     CL27      0 %      0 %     71 %
     CL28      0 %      0 %     79 %
     CL30      0 %      0 %     69 %
     CL33      0 %      0 %     88 %
     CL34      0 %     11 %     90 %
     CL35      0 %     11 %     93 %
     CL37      0 %      0 %     81 %
     CL41      0 %     11 %     98 %
     CL43      0 %      0 %     90 %
     CL45      0 %     11 %     95 %
     CL48      0 %      0 %     86 %
     

Some lines were equal, with distinct classes, thus leading to a contradiction to discriminate between the classes. 33 lines were removed, so only 25 profiles of lines remain with 21 columns of data.


      Equal lines
      ===========
     
        Line         Equals
     1  000773975.1  6 14
     2  001267535.1  3 10
     4  001484765.1
     5  002752575.1  38
     7  900068895.1
     8  000026185.1
     9  000196615.1  16
     11 000745075.1  12
     13 001269445.1
     15 001517405.1
     17 000026985.1  19 58
     18 000027205.1  20 25 26 28 31 32 33 35 43 56
     21 000165815.1
     22 000240705.2  39 41 42 44 45 50 51 53 54 55
     23 000367545.1
     24 000367565.1
     27 000367625.1
     29 000367665.1
     30 000404125.1  36
     34 000590885.1
     37 001050515.1
     40 002732175.1  46 47 49
     48 002732335.1
     52 002732425.1
     57 002803865.1
     
     Classes and contradictions between classes
     ==========================================
     
           Line        Class Equal.Lines Classes
      [1,] 000773975.1 0     3           0 0 1
      [2,] 001267535.1 0     3           0 0 1
      [3,] 001484765.1 0     1           0
      [4,] 002752575.1 0     2           0 2
      [5,] 900068895.1 0     1           0
      [6,] 000026185.1 1     1           1
      [7,] 000196615.1 1     2           1 1
      [8,] 000745075.1 1     2           1 1
      [9,] 001269445.1 1     1           1
     [10,] 001517405.1 1     1           1
     [11,] 000026985.1 2     3           2 2 2
     [12,] 000027205.1 2     11          2 2 2 2 2 2 2 2 2 2 2
     [13,] 000165815.1 2     1           2
     [14,] 000240705.2 2     11          2 2 2 2 2 2 2 2 2 2 2
     [15,] 000367545.1 2     1           2
     [16,] 000367565.1 2     1           2
     [17,] 000367625.1 2     1           2
     [18,] 000367665.1 2     1           2
     [19,] 000404125.1 2     2           2 2
     [20,] 000590885.1 2     1           2
     [21,] 001050515.1 2     1           2
     [22,] 002732175.1 2     4           2 2 2 2
     [23,] 002732335.1 2     1           2
     [24,] 002732425.1 2     1           2
     [25,] 002803865.1 2     1           2
     

Here are the genomes that are an issue:


     Contradiction 1
                                          Classe CL03 CL13 CL14 CL15 CL16 CL18 CL21 CL24 CL25 CL26 CL27 CL28 CL30 CL33 CL34 CL35 CL37 CL41 CL43 CL45 CL48
     000773975.1-Erwinia_typographi|M043b      0    0    0    0    1    0    1    1    0    0    0    0    0    0    0    0    0    0    0    0    0    0
     002865965.1-Erwinia_sp.|B116              0    0    0    0    1    0    1    1    0    0    0    0    0    0    0    0    0    0    0    0    0    0
     001422605.1-Erwinia_sp.|Leaf53            1    0    0    0    1    0    1    1    0    0    0    0    0    0    0    0    0    0    0    0    0    0
     
     Contradiction 2
                                             Classe CL03 CL13 CL14 CL15 CL16 CL18 CL21 CL24 CL25 CL26 CL27 CL28 CL30 CL33 CL34 CL35 CL37 CL41 CL43 CL45 CL48
     001267535.1-Erwinia_iniecta|B120             0    0    1    1    1    1    0    1    0    0    0    0    0    0    0    0    0    0    0    0    0    0
     001267545.1-Erwinia_iniecta|B149             0    0    1    1    1    1    0    1    0    0    0    0    0    0    0    0    0    0    0    0    0    0
     000336255.1-Erwinia_toletana|DAPP-PG735      1    0    1    1    1    1    0    1    0    0    0    0    0    0    0    0    0    0    0    0    0    0
     
     Contradiction 3
                                              Classe CL03 CL13 CL14 CL15 CL16 CL18 CL21 CL24 CL25 CL26 CL27 CL28 CL30 CL33 CL34 CL35 CL37 CL41 CL43 CL45 CL48
     002752575.1-Erwinia_sp.|OLMDLW33              0    1    1    0    1    0    1    1    0    0    0    0    0    0    0    0    0    0    0    0    0    0
     001571305.1-Erwinia_persicina|NBRC102418      2    1    1    0    1    0    1    1    0    0    0    0    0    0    0    0    0    0    0    0    0    0
     

2. Prediction for the P and NP classes

With only 7 genomes in class NPA-NP and 9 genomes in the PA-NP class, it is difficult to have any robust result. So, since also a same profile is found on the NPA-NP class and in the PA-NP class, we merged them in the NP class and we tried to predict the two classes NP and P using the 58 genomes.


     Description des  2 classes pour les  58 génomes
     ===============================================
     
        Num Effectif Pourcentage
     NP   1       16        28 %
     P    2       42        72 %
     

Below are the counts and percentages for these two classes that we also tried to order:

And the data can be found here (class 1 is NP, class 2 is P):

There is no simple solution to our problem so we use logistic regression to try to predict the class.

A simple binary logistic regression with only one variable at a time gives a first idea of the importance of each variables on the prediction of pathogenicity:


          column     AUROC
            CL41    0.9568
            CL43    0.9524
            CL45    0.9449
            CL33    0.9405
            CL35    0.9330
            CL48    0.9286
            CL34    0.9211
            CL37    0.9048
            CL28    0.8929
            CL26    0.8810
            CL25    0.8571
            CL27    0.8571
            CL30    0.8452
            CL18    0.8080
            CL15    0.7991
            CL13    0.7693
            CL16    0.6875
            CL24    0.6667
            CL14    0.6562
            CL03    0.5818
            CL21    0.5074
     

A multiple binary logistic regression is able to predict the class using 5 variables only:


     classe ~ CL16 + CL21 + CL34 + CL35 + CL43
     
     Deviance Residuals:
          Min        1Q    Median        3Q       Max
     -0.45904   0.00000   0.00002   0.00002   2.14597
     
     Coefficients:
                 Estimate Std. Error z value Pr(>|z|)
     (Intercept)    20.86   14530.08   0.001    0.999
     CL16          -41.15   21211.30  -0.002    0.998
     CL21          -23.06   14530.09  -0.002    0.999
     CL34          -45.13   68159.64  -0.001    0.999
     CL35           24.76   48196.14   0.001    1.000
     CL43           45.13   48826.18   0.001    0.999
     
     Null deviance:     68.3243  on 57  degrees of freedom
     Residual deviance:  6.5017  on 52  degrees of freedom
     AIC:               18.502
     
     Number of Fisher Scoring iterations: 21
     
     Auroc :     0.9933
     

Using this multiple binary logistic regression, only one genome is not well classified, 001571305.1 Erwinia_persicina|NBRC102418, as expected.

Here are the graphical representation of the 5 variables:

       non su

But it may be easier to read with three distinct graphics:

       non su

       non su

       non su

3. Concluding remarks

Though there is a statistical solution, may be a profile could be defined using the 5 variables to predict pathogenicity.

 

 

retour gH    Retour à la page principale de   (gH)