a predipath secondary metabolites data

 

Valid XHTML     Valid CSS2    

PREDIPATH predictions using the secondary metabolites data

Clickable table of contents

  1. Preparation and description of the data

  2. Prediction for the P and NP classes

  3. Concluding remarks

1. Preparation and description of the data

Data was given as a table of 67 lines by 30 columns. Column number 13 was removed because it was named cf_putative. We kept only the 59 genomes that were annotated NPA-NP, PA-NP or PA-P (NPA: non plant associated, PA: plant associated, NP: non pathogenic, P: pathogenic). Here is the description of the columns:


     NAME OF THE COLUMN          number of positive values          sum
     acyl_amino_acids                                    2            2
     amglyccycl                                          1            1
     arylpolyene                                         5            5
     arylpolyene.cf_saccharide                           2            2
     bacteriocin                                         2            2
     butyrolactone                                       5            5
     butyrolactone.cf_saccharide                         1            1
     cf_fatty_acid                                      59          132
     cf_fatty_acid.acyl_amino_acids.ladderane            2            2
     cf_fatty_acid.arylpolyene                           2            2
     cf_fatty_acid.butyrolactone                         1            1
     cf_fatty_acid.ladderane.siderophore                 1            1
     cf_saccharide                                      56          134
     cyanobactin                                         0            0
     hserlactone                                        21           26
     hserlactone.arylpolyene.cf_saccharide               8            8
     hserlactone.cf_fatty_acid                           1            1
     lantipeptide                                        1            1
     nrps                                               55           83
     other                                               5            6
     phosphonate                                         1            1
     phosphonate.cf_saccharide.nrps                      1            1
     phosphonate.nrps                                    1            1
     siderophore                                        52           55
     t1pks                                               3            3
     t1pks.nrps                                         39           40
     terpene                                             5            5
     thiopeptide                                        22           22
     transatpks.nrps                                     1            1
     

We then decided to shorten the names of some columns. Here is the new description of the columns:


     Number                NAME         SUM
     01                 acyl_AA           2
     02              amglyccycl           1
     03             arylpolyene           5
     04         arylpolyene.CFS           2
     05             bacteriocin           2
     06           butyrolactone           5
     07       butyrolactone.CFS           1
     08                    CFFA          59
     09  CFFA.acyl_AA.ladderane           2
     10       CFFA.arylpolyene            2
     11     CFFA.butyrolactone            1
     12              CFFA.LASI            1
     13                    CFS           56
     14            cyanobactin            0
     15                   HSER           21
     16   HSER.arylpolyene.CFS            8
     17              HSER.CFFA            1
     18           lantipeptide            1
     19                   nrps           55
     20                  other            5
     21                  PHOSP            1
     22         PHOSP.CFS.nrps            1
     23             PHOSP.nrps            1
     24            siderophore           52
     25                  t1pks            3
     26             t1pks.nrps           39
     27                terpene            5
     28            thiopeptide           22
     29        transatpks.nrps            1
     

With a small R script, it was easy to get rid of 2 constant columns (CFFA, cyanobactin):


                        Column Const distinctVals keep/force
     1                 acyl_AA    No            2        YES
     2              amglyccycl    No            2        YES
     3             arylpolyene    No            2        YES
     4         arylpolyene.CFS    No            2        YES
     5             bacteriocin    No            2        YES
     6           butyrolactone    No            2        YES
     7       butyrolactone.CFS    No            2        YES
     8                    CFFA   Yes            1
     9  CFFA.acyl_AA.ladderane    No            2        YES
     10       CFFA.arylpolyene    No            2        YES
     11     CFFA.butyrolactone    No            2        YES
     12              CFFA.LASI    No            2        YES
     13                    CFS    No            2        YES
     14            cyanobactin   Yes            1
     15                   HSER    No            2        YES
     16   HSER.arylpolyene.CFS    No            2        YES
     17              HSER.CFFA    No            2        YES
     18           lantipeptide    No            2        YES
     19                   nrps    No            2        YES
     20                  other    No            2        YES
     21                  PHOSP    No            2        YES
     22         PHOSP.CFS.nrps    No            2        YES
     23             PHOSP.nrps    No            2        YES
     24            siderophore    No            2        YES
     25                  t1pks    No            2        YES
     26             t1pks.nrps    No            2        YES
     27                terpene    No            2        YES
     28            thiopeptide    No            2        YES
     29        transatpks.nrps    No            2        YES
     

With another small R script, it was easy to get rid of 2 more equivalent or redundant columns (CFFA.butyrolactone, HSER.CFFA):


        Column                  Equal               Opposite
     1  acyl_AA
     2  amglyccycl
     3  arylpolyene
     4  arylpolyene.CFS
     5  bacteriocin
     6  butyrolactone
     7  butyrolactone.CFS       CFFA.butyrolactone
     8  CFFA.acyl_AA.ladderane
     9  CFFA.arylpolyene
     11 CFFA.LASI               HSER.CFFA
     12 CFS
     13 HSER
     14 HSER.arylpolyene.CFS
     16 lantipeptide
     17 nrps
     18 other
     19 PHOSP
     20 PHOSP.CFS.nrps
     21 PHOSP.nrps
     22 siderophore
     23 t1pks
     24 t1pks.nrps
     25 terpene
     26 thiopeptide
     27 transatpks.nrps
     

For the remaining 24 columns, we removed 13 columns with a near zero variance:


     VARIABLE               freqRatio percentUnique zeroVar   nzv
     acyl_AA                28.500000      3.389831   FALSE  TRUE
     amglyccycl             58.000000      3.389831   FALSE  TRUE
     arylpolyene            10.800000      3.389831   FALSE FALSE
     arylpolyene.CFS        28.500000      3.389831   FALSE  TRUE
     bacteriocin            28.500000      3.389831   FALSE  TRUE
     butyrolactone          10.800000      3.389831   FALSE FALSE
     butyrolactone.CFS      58.000000      3.389831   FALSE  TRUE
     CFFA.acyl_AA.ladderane 28.500000      3.389831   FALSE  TRUE
     CFFA.arylpolyene       28.500000      3.389831   FALSE  TRUE
     CFFA.LASI              58.000000      3.389831   FALSE  TRUE
     CFS                    18.666667      3.389831   FALSE FALSE
     HSER                    1.809524      3.389831   FALSE FALSE
     HSER.arylpolyene.CFS    6.375000      3.389831   FALSE FALSE
     lantipeptide           58.000000      3.389831   FALSE  TRUE
     nrps                   13.750000      3.389831   FALSE FALSE
     other                  10.800000      3.389831   FALSE FALSE
     PHOSP                  58.000000      3.389831   FALSE  TRUE
     PHOSP.CFS.nrps         58.000000      3.389831   FALSE  TRUE
     PHOSP.nrps             58.000000      3.389831   FALSE  TRUE
     siderophore             7.428571      3.389831   FALSE FALSE
     t1pks                  18.666667      3.389831   FALSE FALSE
     t1pks.nrps              1.950000      3.389831   FALSE FALSE
     terpene                10.800000      3.389831   FALSE FALSE
     thiopeptide             1.681818      3.389831   FALSE FALSE
     transatpks.nrps        58.000000      3.389831   FALSE  TRUE
     

So for the remaining part of the analysis, we had only 12 columns and 59 genomes in three classes:


     
     Description des  3 classes pour les  59 génomes
     ===============================================
     
            Num Effectif Pourcentage
     NPA-NP   0        7        12 %
     PA-NP    1        9        15 %
     PA-P     2       43        73 %
     
     Comptages par classe pour les  12 métabolites secondaires
     =========================================================
     
                          NPA-NP PA-NP PA-P total
     arylpolyene               2     1    2     5
     butyrolactone             2     2    1     5
     CFS                       7     9   40    56
     HSER                      6     9    6    21
     HSER.arylpolyene.CFS      1     6    1     8
     nrps                      6     8   41    55
     other                     2     0    3     5
     siderophore               5     8   39    52
     t1pks                     0     0    3     3
     t1pks.nrps                4     1   34    39
     terpene                   1     4    0     5
     thiopeptide               7     9    6    22
     
     Pourcentages par classe pour les  12 métabolites secondaires
     ============================================================
     
                            NPA-NP    PA-NP     PA-P
     arylpolyene              29 %     11 %      5 %
     butyrolactone            29 %     22 %      2 %
     CFS                     100 %    100 %     93 %
     HSER                     86 %    100 %     14 %
     HSER.arylpolyene.CFS     14 %     67 %      2 %
     nrps                     86 %     89 %     95 %
     other                    29 %      0 %      7 %
     siderophore              71 %     89 %     91 %
     t1pks                     0 %      0 %      7 %
     t1pks.nrps               57 %     11 %     79 %
     terpene                  14 %     44 %      0 %
     thiopeptide             100 %    100 %     14 %
     

Some lines were equal, with distinct classes, thus leading to a contradiction to discriminate between the classes. 39 lines were removed, so only 20 profiles of lines remain.


     Equal lines
     ===========
     
        Line             Equals
     1  NPNP-000773975.1
     2  NPNP-001267535.1  3
     4  NPNP-001484765.1
     5  NPNP-002752575.1  8
     6  NPNP-002865965.1
     7  NPNP-900068895.1
     9  NPPA-000196615.1  13 16
     10 NPPA-000336255.1
     11 NPPA-000745075.1
     12 NPPA-000770305.1
     14 NPPA-001422605.1
     15 NPPA-001517405.1
     17 PATH-000026985.1  19
     18 PATH-000027205.1  20 22 23 24 25 26 27 28 29 31 32 33 35 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57
     21 PATH-000165815.1  58
     30 PATH-000404125.1  36
     34 PATH-000590885.1
     37 PATH-001050515.1
     38 PATH-001571305.1
     59 PATH-003485445.1
     
     Classes and contradictions between classes
     ==========================================
     
           Line             Class Equal.Lines Classes
      [1,] NPNP-000773975.1 0     1           0
      [2,] NPNP-001267535.1 0     2           0 0
      [3,] NPNP-001484765.1 0     1           0
      [4,] NPNP-002752575.1 0     2           0 1
      [5,] NPNP-002865965.1 0     1           0
      [6,] NPNP-900068895.1 0     1           0
      [7,] NPPA-000196615.1 1     3           1 1 1
      [8,] NPPA-000336255.1 1     1           1
      [9,] NPPA-000745075.1 1     1           1
     [10,] NPPA-000770305.1 1     1           1
     [11,] NPPA-001422605.1 1     1           1
     [12,] NPPA-001517405.1 1     1           1
     [13,] PATH-000026985.1 2     2           2 2
     [14,] PATH-000027205.1 2     33          2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
     [15,] PATH-000165815.1 2     2           2 2
     [16,] PATH-000404125.1 2     2           2 2
     [17,] PATH-000590885.1 2     1           2
     [18,] PATH-001050515.1 2     1           2
     [19,] PATH-001571305.1 2     1           2
     [20,] PATH-003485445.1 2     1           2
     

Here are the genomes that are an issue:


                                                 Classe arylpolyene butyrolactone CFS HSER HSER.arylpolyene.CFS nrps other siderophore t1pks t1pks.nrps terpene thiopeptide
     GCF_002752575.1-Erwinia_sp.|OLMDLW33             0           0             1   1    1                    0    1     0           1     0          0       0           1
     GCF_000026185.1-Erwinia_tasmaniensis|Et1_99      1           0             1   1    1                    0    1     0           1     0          0       0           1
     

2. Prediction for the P and NP classes

With only 7 genomes in class NPA-NP and 9 genomes in the PA-NP class, it is difficult to have any robust result. So, since also a same profile is found on the NPA-NP class and in the PA-NP class, we merged them in the NP class and we tried to predict the two classes NP and P using the 59 genomes.


     Description des  2 classes pour les  59 génomes
     ===============================================
     
             Num Effectif Pourcentage
          NP   1       16        27 %
          P    2       43        73 %
     

Below are the counts and percentages for these two classes that we also tried to order:

And the data can be found here (class 1 is NP, class 2 is P):

There is no obvious solution to our problem.

A graphical representation such as a count heatmap makes it easier to see the data. Columns are ordered with maximum count of positive NP and minimal count of positive P.

       non su

Using the three classes:

       non su

It may be easier to see the classes with three different graphics:

       non su

       non su

       non su

A simple binary logistic regression with only one variable at a time gives a first idea of the importance of each variables on the prediction of pathogenicity:


     
     Binary Logistic Regression with only one column
     ===============================================
     
                    Column     AUROC       NP        P
               thiopeptide    0.9302    100 %     14 %
                      HSER    0.8990     94 %     14 %
                t1pks.nrps    0.7391     31 %     79 %
      HSER.arylpolyene.CFS    0.7071     44 %      2 %
                   terpene    0.6562     31 %      0 %
             butyrolactone    0.6134     25 %      2 %
               arylpolyene    0.5705     19 %      5 %
               siderophore    0.5472     81 %     91 %
                      nrps    0.5392     88 %     95 %
                       CFS    0.5349    100 %     93 %
                     t1pks    0.5349      0 %      7 %
                     other    0.5276     12 %      7 %
     

A complete multiple binary logistic regression is able to predict the class using 9 variables only:


     Coefficients          Estimate Std. Error z value Pr(>|z|)
     (Intercept)              28.04  161829.67       0        1
     arylpolyene             -49.14  166878.33       0        1
     butyrolactone           -97.19  214196.19       0        1
     HSER                     99.68  402750.77       0        1
     HSER.arylpolyene.CFS    -49.15  145106.92       0        1
     nrps                     47.95  140693.62       0        1
     siderophore             -49.42  167965.54       0        1
     t1pks                    98.05  246528.08       0        1
     terpene                 -47.97  141612.77       0        1
     thiopeptide            -102.55  398075.87       0        1
     

A graphical representation of these 9 variables only is helpful:

       non su

Using the three classes:

       non su

Here again, it may be easier to see the classes with three different graphics:

       non su

       non su

       non su

3. Concluding remarks

Though we do not have a lot of genomes in each class, it is possible to build a logistic regression model that is able to predict pathogenicity.

 

 

retour gH    Retour à la page principale de   (gH)