a
PREDIPATH predictions using the secondary metabolites data
Clickable table of contents
1. Preparation and description of the data
1. Preparation and description of the data
Data was given as a table of 67 lines by 30 columns. Column number 13 was removed because it was named cf_putative. We kept only the 59 genomes that were annotated NPA-NP, PA-NP or PA-P (NPA: non plant associated, PA: plant associated, NP: non pathogenic, P: pathogenic). Here is the description of the columns:
NAME OF THE COLUMN number of positive values sum acyl_amino_acids 2 2 amglyccycl 1 1 arylpolyene 5 5 arylpolyene.cf_saccharide 2 2 bacteriocin 2 2 butyrolactone 5 5 butyrolactone.cf_saccharide 1 1 cf_fatty_acid 59 132 cf_fatty_acid.acyl_amino_acids.ladderane 2 2 cf_fatty_acid.arylpolyene 2 2 cf_fatty_acid.butyrolactone 1 1 cf_fatty_acid.ladderane.siderophore 1 1 cf_saccharide 56 134 cyanobactin 0 0 hserlactone 21 26 hserlactone.arylpolyene.cf_saccharide 8 8 hserlactone.cf_fatty_acid 1 1 lantipeptide 1 1 nrps 55 83 other 5 6 phosphonate 1 1 phosphonate.cf_saccharide.nrps 1 1 phosphonate.nrps 1 1 siderophore 52 55 t1pks 3 3 t1pks.nrps 39 40 terpene 5 5 thiopeptide 22 22 transatpks.nrps 1 1We then decided to shorten the names of some columns. Here is the new description of the columns:
Number NAME SUM 01 acyl_AA 2 02 amglyccycl 1 03 arylpolyene 5 04 arylpolyene.CFS 2 05 bacteriocin 2 06 butyrolactone 5 07 butyrolactone.CFS 1 08 CFFA 59 09 CFFA.acyl_AA.ladderane 2 10 CFFA.arylpolyene 2 11 CFFA.butyrolactone 1 12 CFFA.LASI 1 13 CFS 56 14 cyanobactin 0 15 HSER 21 16 HSER.arylpolyene.CFS 8 17 HSER.CFFA 1 18 lantipeptide 1 19 nrps 55 20 other 5 21 PHOSP 1 22 PHOSP.CFS.nrps 1 23 PHOSP.nrps 1 24 siderophore 52 25 t1pks 3 26 t1pks.nrps 39 27 terpene 5 28 thiopeptide 22 29 transatpks.nrps 1With a small R script, it was easy to get rid of 2 constant columns (CFFA, cyanobactin):
Column Const distinctVals keep/force 1 acyl_AA No 2 YES 2 amglyccycl No 2 YES 3 arylpolyene No 2 YES 4 arylpolyene.CFS No 2 YES 5 bacteriocin No 2 YES 6 butyrolactone No 2 YES 7 butyrolactone.CFS No 2 YES 8 CFFA Yes 1 9 CFFA.acyl_AA.ladderane No 2 YES 10 CFFA.arylpolyene No 2 YES 11 CFFA.butyrolactone No 2 YES 12 CFFA.LASI No 2 YES 13 CFS No 2 YES 14 cyanobactin Yes 1 15 HSER No 2 YES 16 HSER.arylpolyene.CFS No 2 YES 17 HSER.CFFA No 2 YES 18 lantipeptide No 2 YES 19 nrps No 2 YES 20 other No 2 YES 21 PHOSP No 2 YES 22 PHOSP.CFS.nrps No 2 YES 23 PHOSP.nrps No 2 YES 24 siderophore No 2 YES 25 t1pks No 2 YES 26 t1pks.nrps No 2 YES 27 terpene No 2 YES 28 thiopeptide No 2 YES 29 transatpks.nrps No 2 YESWith another small R script, it was easy to get rid of 2 more equivalent or redundant columns (CFFA.butyrolactone, HSER.CFFA):
Column Equal Opposite 1 acyl_AA 2 amglyccycl 3 arylpolyene 4 arylpolyene.CFS 5 bacteriocin 6 butyrolactone 7 butyrolactone.CFS CFFA.butyrolactone 8 CFFA.acyl_AA.ladderane 9 CFFA.arylpolyene 11 CFFA.LASI HSER.CFFA 12 CFS 13 HSER 14 HSER.arylpolyene.CFS 16 lantipeptide 17 nrps 18 other 19 PHOSP 20 PHOSP.CFS.nrps 21 PHOSP.nrps 22 siderophore 23 t1pks 24 t1pks.nrps 25 terpene 26 thiopeptide 27 transatpks.nrpsFor the remaining 24 columns, we removed 13 columns with a near zero variance:
VARIABLE freqRatio percentUnique zeroVar nzv acyl_AA 28.500000 3.389831 FALSE TRUE amglyccycl 58.000000 3.389831 FALSE TRUE arylpolyene 10.800000 3.389831 FALSE FALSE arylpolyene.CFS 28.500000 3.389831 FALSE TRUE bacteriocin 28.500000 3.389831 FALSE TRUE butyrolactone 10.800000 3.389831 FALSE FALSE butyrolactone.CFS 58.000000 3.389831 FALSE TRUE CFFA.acyl_AA.ladderane 28.500000 3.389831 FALSE TRUE CFFA.arylpolyene 28.500000 3.389831 FALSE TRUE CFFA.LASI 58.000000 3.389831 FALSE TRUE CFS 18.666667 3.389831 FALSE FALSE HSER 1.809524 3.389831 FALSE FALSE HSER.arylpolyene.CFS 6.375000 3.389831 FALSE FALSE lantipeptide 58.000000 3.389831 FALSE TRUE nrps 13.750000 3.389831 FALSE FALSE other 10.800000 3.389831 FALSE FALSE PHOSP 58.000000 3.389831 FALSE TRUE PHOSP.CFS.nrps 58.000000 3.389831 FALSE TRUE PHOSP.nrps 58.000000 3.389831 FALSE TRUE siderophore 7.428571 3.389831 FALSE FALSE t1pks 18.666667 3.389831 FALSE FALSE t1pks.nrps 1.950000 3.389831 FALSE FALSE terpene 10.800000 3.389831 FALSE FALSE thiopeptide 1.681818 3.389831 FALSE FALSE transatpks.nrps 58.000000 3.389831 FALSE TRUESo for the remaining part of the analysis, we had only 12 columns and 59 genomes in three classes:
Description des 3 classes pour les 59 génomes =============================================== Num Effectif Pourcentage NPA-NP 0 7 12 % PA-NP 1 9 15 % PA-P 2 43 73 % Comptages par classe pour les 12 métabolites secondaires ========================================================= NPA-NP PA-NP PA-P total arylpolyene 2 1 2 5 butyrolactone 2 2 1 5 CFS 7 9 40 56 HSER 6 9 6 21 HSER.arylpolyene.CFS 1 6 1 8 nrps 6 8 41 55 other 2 0 3 5 siderophore 5 8 39 52 t1pks 0 0 3 3 t1pks.nrps 4 1 34 39 terpene 1 4 0 5 thiopeptide 7 9 6 22 Pourcentages par classe pour les 12 métabolites secondaires ============================================================ NPA-NP PA-NP PA-P arylpolyene 29 % 11 % 5 % butyrolactone 29 % 22 % 2 % CFS 100 % 100 % 93 % HSER 86 % 100 % 14 % HSER.arylpolyene.CFS 14 % 67 % 2 % nrps 86 % 89 % 95 % other 29 % 0 % 7 % siderophore 71 % 89 % 91 % t1pks 0 % 0 % 7 % t1pks.nrps 57 % 11 % 79 % terpene 14 % 44 % 0 % thiopeptide 100 % 100 % 14 %Some lines were equal, with distinct classes, thus leading to a contradiction to discriminate between the classes. 39 lines were removed, so only 20 profiles of lines remain.
Equal lines =========== Line Equals 1 NPNP-000773975.1 2 NPNP-001267535.1 3 4 NPNP-001484765.1 5 NPNP-002752575.1 8 6 NPNP-002865965.1 7 NPNP-900068895.1 9 NPPA-000196615.1 13 16 10 NPPA-000336255.1 11 NPPA-000745075.1 12 NPPA-000770305.1 14 NPPA-001422605.1 15 NPPA-001517405.1 17 PATH-000026985.1 19 18 PATH-000027205.1 20 22 23 24 25 26 27 28 29 31 32 33 35 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 21 PATH-000165815.1 58 30 PATH-000404125.1 36 34 PATH-000590885.1 37 PATH-001050515.1 38 PATH-001571305.1 59 PATH-003485445.1 Classes and contradictions between classes ========================================== Line Class Equal.Lines Classes [1,] NPNP-000773975.1 0 1 0 [2,] NPNP-001267535.1 0 2 0 0 [3,] NPNP-001484765.1 0 1 0 [4,] NPNP-002752575.1 0 2 0 1 [5,] NPNP-002865965.1 0 1 0 [6,] NPNP-900068895.1 0 1 0 [7,] NPPA-000196615.1 1 3 1 1 1 [8,] NPPA-000336255.1 1 1 1 [9,] NPPA-000745075.1 1 1 1 [10,] NPPA-000770305.1 1 1 1 [11,] NPPA-001422605.1 1 1 1 [12,] NPPA-001517405.1 1 1 1 [13,] PATH-000026985.1 2 2 2 2 [14,] PATH-000027205.1 2 33 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 [15,] PATH-000165815.1 2 2 2 2 [16,] PATH-000404125.1 2 2 2 2 [17,] PATH-000590885.1 2 1 2 [18,] PATH-001050515.1 2 1 2 [19,] PATH-001571305.1 2 1 2 [20,] PATH-003485445.1 2 1 2Here are the genomes that are an issue:
Classe arylpolyene butyrolactone CFS HSER HSER.arylpolyene.CFS nrps other siderophore t1pks t1pks.nrps terpene thiopeptide GCF_002752575.1-Erwinia_sp.|OLMDLW33 0 0 1 1 1 0 1 0 1 0 0 0 1 GCF_000026185.1-Erwinia_tasmaniensis|Et1_99 1 0 1 1 1 0 1 0 1 0 0 0 12. Prediction for the P and NP classes
With only 7 genomes in class NPA-NP and 9 genomes in the PA-NP class, it is difficult to have any robust result. So, since also a same profile is found on the NPA-NP class and in the PA-NP class, we merged them in the NP class and we tried to predict the two classes NP and P using the 59 genomes.
Description des 2 classes pour les 59 génomes =============================================== Num Effectif Pourcentage NP 1 16 27 % P 2 43 73 %Below are the counts and percentages for these two classes that we also tried to order:
And the data can be found here (class 1 is NP, class 2 is P):
There is no obvious solution to our problem.
A graphical representation such as a count heatmap makes it easier to see the data. Columns are ordered with maximum count of positive NP and minimal count of positive P.
Using the three classes:
It may be easier to see the classes with three different graphics:
A simple binary logistic regression with only one variable at a time gives a first idea of the importance of each variables on the prediction of pathogenicity:
Binary Logistic Regression with only one column =============================================== Column AUROC NP P thiopeptide 0.9302 100 % 14 % HSER 0.8990 94 % 14 % t1pks.nrps 0.7391 31 % 79 % HSER.arylpolyene.CFS 0.7071 44 % 2 % terpene 0.6562 31 % 0 % butyrolactone 0.6134 25 % 2 % arylpolyene 0.5705 19 % 5 % siderophore 0.5472 81 % 91 % nrps 0.5392 88 % 95 % CFS 0.5349 100 % 93 % t1pks 0.5349 0 % 7 % other 0.5276 12 % 7 %A complete multiple binary logistic regression is able to predict the class using 9 variables only:
Coefficients Estimate Std. Error z value Pr(>|z|) (Intercept) 28.04 161829.67 0 1 arylpolyene -49.14 166878.33 0 1 butyrolactone -97.19 214196.19 0 1 HSER 99.68 402750.77 0 1 HSER.arylpolyene.CFS -49.15 145106.92 0 1 nrps 47.95 140693.62 0 1 siderophore -49.42 167965.54 0 1 t1pks 98.05 246528.08 0 1 terpene -47.97 141612.77 0 1 thiopeptide -102.55 398075.87 0 1A graphical representation of these 9 variables only is helpful:
Using the three classes:
Here again, it may be easier to see the classes with three different graphics:
3. Concluding remarks
Though we do not have a lot of genomes in each class, it is possible to build a logistic regression model that is able to predict pathogenicity.
Retour à la page principale de (gH)