predipath secondary metabolites data

a predipath secondary metabolites data

PREDIPATH predictions using the secondary metabolites data

Clickable table of contents

 1. Preparation and description of the data

 2. Prediction for the P and NP classes

 3. Concluding remarks

1. Preparation and description of the data
Data was given as a table of 67 lines by 30 columns. Column number 13 was removed because it was named cf_putative. We kept only the 59 genomes that were annotated NPA-NP, PA-NP or PA-P (NPA: non plant associated, PA: plant associated, NP: non pathogenic, P: pathogenic). Here is the description of the columns:
 NAME OF THE COLUMN number of positive values sum
 acyl_amino_acids 2 2
 amglyccycl 1 1
 arylpolyene 5 5
 arylpolyene.cf_saccharide 2 2
 bacteriocin 2 2
 butyrolactone 5 5
 butyrolactone.cf_saccharide 1 1
 cf_fatty_acid 59 132
 cf_fatty_acid.acyl_amino_acids.ladderane 2 2
 cf_fatty_acid.arylpolyene 2 2
 cf_fatty_acid.butyrolactone 1 1
 cf_fatty_acid.ladderane.siderophore 1 1
 cf_saccharide 56 134
 cyanobactin 0 0
 hserlactone 21 26
 hserlactone.arylpolyene.cf_saccharide 8 8
 hserlactone.cf_fatty_acid 1 1
 lantipeptide 1 1
 nrps 55 83
 other 5 6
 phosphonate 1 1
 phosphonate.cf_saccharide.nrps 1 1
 phosphonate.nrps 1 1
 siderophore 52 55
 t1pks 3 3
 t1pks.nrps 39 40
 terpene 5 5
 thiopeptide 22 22
 transatpks.nrps 1 1
 
We then decided to shorten the names of some columns. Here is the new description of the columns:
 Number NAME SUM
 01 acyl_AA 2
 02 amglyccycl 1
 03 arylpolyene 5
 04 arylpolyene.CFS 2
 05 bacteriocin 2
 06 butyrolactone 5
 07 butyrolactone.CFS 1
 08 CFFA 59
 09 CFFA.acyl_AA.ladderane 2
 10 CFFA.arylpolyene 2
 11 CFFA.butyrolactone 1
 12 CFFA.LASI 1
 13 CFS 56
 14 cyanobactin 0
 15 HSER 21
 16 HSER.arylpolyene.CFS 8
 17 HSER.CFFA 1
 18 lantipeptide 1
 19 nrps 55
 20 other 5
 21 PHOSP 1
 22 PHOSP.CFS.nrps 1
 23 PHOSP.nrps 1
 24 siderophore 52
 25 t1pks 3
 26 t1pks.nrps 39
 27 terpene 5
 28 thiopeptide 22
 29 transatpks.nrps 1
 
With a small R script, it was easy to get rid of 2 constant columns (CFFA, cyanobactin):
 Column Const distinctVals keep/force
 1 acyl_AA No 2 YES
 2 amglyccycl No 2 YES
 3 arylpolyene No 2 YES
 4 arylpolyene.CFS No 2 YES
 5 bacteriocin No 2 YES
 6 butyrolactone No 2 YES
 7 butyrolactone.CFS No 2 YES
 8 CFFA Yes 1
 9 CFFA.acyl_AA.ladderane No 2 YES
 10 CFFA.arylpolyene No 2 YES
 11 CFFA.butyrolactone No 2 YES
 12 CFFA.LASI No 2 YES
 13 CFS No 2 YES
 14 cyanobactin Yes 1
 15 HSER No 2 YES
 16 HSER.arylpolyene.CFS No 2 YES
 17 HSER.CFFA No 2 YES
 18 lantipeptide No 2 YES
 19 nrps No 2 YES
 20 other No 2 YES
 21 PHOSP No 2 YES
 22 PHOSP.CFS.nrps No 2 YES
 23 PHOSP.nrps No 2 YES
 24 siderophore No 2 YES
 25 t1pks No 2 YES
 26 t1pks.nrps No 2 YES
 27 terpene No 2 YES
 28 thiopeptide No 2 YES
 29 transatpks.nrps No 2 YES
 
With another small R script, it was easy to get rid of 2 more equivalent or redundant columns (CFFA.butyrolactone, HSER.CFFA):
 Column Equal Opposite
 1 acyl_AA
 2 amglyccycl
 3 arylpolyene
 4 arylpolyene.CFS
 5 bacteriocin
 6 butyrolactone
 7 butyrolactone.CFS CFFA.butyrolactone
 8 CFFA.acyl_AA.ladderane
 9 CFFA.arylpolyene
 11 CFFA.LASI HSER.CFFA
 12 CFS
 13 HSER
 14 HSER.arylpolyene.CFS
 16 lantipeptide
 17 nrps
 18 other
 19 PHOSP
 20 PHOSP.CFS.nrps
 21 PHOSP.nrps
 22 siderophore
 23 t1pks
 24 t1pks.nrps
 25 terpene
 26 thiopeptide
 27 transatpks.nrps
 
For the remaining 24 columns, we removed 13 columns with a near zero variance:
 VARIABLE freqRatio percentUnique zeroVar nzv
 acyl_AA 28.500000 3.389831 FALSE TRUE
 amglyccycl 58.000000 3.389831 FALSE TRUE
 arylpolyene 10.800000 3.389831 FALSE FALSE
 arylpolyene.CFS 28.500000 3.389831 FALSE TRUE
 bacteriocin 28.500000 3.389831 FALSE TRUE
 butyrolactone 10.800000 3.389831 FALSE FALSE
 butyrolactone.CFS 58.000000 3.389831 FALSE TRUE
 CFFA.acyl_AA.ladderane 28.500000 3.389831 FALSE TRUE
 CFFA.arylpolyene 28.500000 3.389831 FALSE TRUE
 CFFA.LASI 58.000000 3.389831 FALSE TRUE
 CFS 18.666667 3.389831 FALSE FALSE
 HSER 1.809524 3.389831 FALSE FALSE
 HSER.arylpolyene.CFS 6.375000 3.389831 FALSE FALSE
 lantipeptide 58.000000 3.389831 FALSE TRUE
 nrps 13.750000 3.389831 FALSE FALSE
 other 10.800000 3.389831 FALSE FALSE
 PHOSP 58.000000 3.389831 FALSE TRUE
 PHOSP.CFS.nrps 58.000000 3.389831 FALSE TRUE
 PHOSP.nrps 58.000000 3.389831 FALSE TRUE
 siderophore 7.428571 3.389831 FALSE FALSE
 t1pks 18.666667 3.389831 FALSE FALSE
 t1pks.nrps 1.950000 3.389831 FALSE FALSE
 terpene 10.800000 3.389831 FALSE FALSE
 thiopeptide 1.681818 3.389831 FALSE FALSE
 transatpks.nrps 58.000000 3.389831 FALSE TRUE
 
So for the remaining part of the analysis, we had only 12 columns and 59 genomes in three classes:
 
 Description des 3 classes pour les 59 génomes
 ===============================================
 
 Num Effectif Pourcentage
 NPA-NP 0 7 12 %
 PA-NP 1 9 15 %
 PA-P 2 43 73 %
 
 Comptages par classe pour les 12 métabolites secondaires
 =========================================================
 
 NPA-NP PA-NP PA-P total
 arylpolyene 2 1 2 5
 butyrolactone 2 2 1 5
 CFS 7 9 40 56
 HSER 6 9 6 21
 HSER.arylpolyene.CFS 1 6 1 8
 nrps 6 8 41 55
 other 2 0 3 5
 siderophore 5 8 39 52
 t1pks 0 0 3 3
 t1pks.nrps 4 1 34 39
 terpene 1 4 0 5
 thiopeptide 7 9 6 22
 
 Pourcentages par classe pour les 12 métabolites secondaires
 ============================================================
 
 NPA-NP PA-NP PA-P
 arylpolyene 29 % 11 % 5 %
 butyrolactone 29 % 22 % 2 %
 CFS 100 % 100 % 93 %
 HSER 86 % 100 % 14 %
 HSER.arylpolyene.CFS 14 % 67 % 2 %
 nrps 86 % 89 % 95 %
 other 29 % 0 % 7 %
 siderophore 71 % 89 % 91 %
 t1pks 0 % 0 % 7 %
 t1pks.nrps 57 % 11 % 79 %
 terpene 14 % 44 % 0 %
 thiopeptide 100 % 100 % 14 %
 
Some lines were equal, with distinct classes, thus leading to a contradiction to discriminate between the classes. 39 lines were removed, so only 20 profiles of lines remain.
 Equal lines
 ===========
 
 Line Equals
 1 NPNP-000773975.1
 2 NPNP-001267535.1 3
 4 NPNP-001484765.1
 5 NPNP-002752575.1 8
 6 NPNP-002865965.1
 7 NPNP-900068895.1
 9 NPPA-000196615.1 13 16
 10 NPPA-000336255.1
 11 NPPA-000745075.1
 12 NPPA-000770305.1
 14 NPPA-001422605.1
 15 NPPA-001517405.1
 17 PATH-000026985.1 19
 18 PATH-000027205.1 20 22 23 24 25 26 27 28 29 31 32 33 35 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57
 21 PATH-000165815.1 58
 30 PATH-000404125.1 36
 34 PATH-000590885.1
 37 PATH-001050515.1
 38 PATH-001571305.1
 59 PATH-003485445.1
 
 Classes and contradictions between classes
 ==========================================
 
 Line Class Equal.Lines Classes
 [1,] NPNP-000773975.1 0 1 0
 [2,] NPNP-001267535.1 0 2 0 0
 [3,] NPNP-001484765.1 0 1 0
 [4,] NPNP-002752575.1 0 2 0 1
 [5,] NPNP-002865965.1 0 1 0
 [6,] NPNP-900068895.1 0 1 0
 [7,] NPPA-000196615.1 1 3 1 1 1
 [8,] NPPA-000336255.1 1 1 1
 [9,] NPPA-000745075.1 1 1 1
 [10,] NPPA-000770305.1 1 1 1
 [11,] NPPA-001422605.1 1 1 1
 [12,] NPPA-001517405.1 1 1 1
 [13,] PATH-000026985.1 2 2 2 2
 [14,] PATH-000027205.1 2 33 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 [15,] PATH-000165815.1 2 2 2 2
 [16,] PATH-000404125.1 2 2 2 2
 [17,] PATH-000590885.1 2 1 2
 [18,] PATH-001050515.1 2 1 2
 [19,] PATH-001571305.1 2 1 2
 [20,] PATH-003485445.1 2 1 2
 
Here are the genomes that are an issue:
 Classe arylpolyene butyrolactone CFS HSER HSER.arylpolyene.CFS nrps other siderophore t1pks t1pks.nrps terpene thiopeptide
 GCF_002752575.1-Erwinia_sp.|OLMDLW33 0 0 1 1 1 0 1 0 1 0 0 0 1
 GCF_000026185.1-Erwinia_tasmaniensis|Et1_99 1 0 1 1 1 0 1 0 1 0 0 0 1
 
2. Prediction for the P and NP classes
With only 7 genomes in class NPA-NP and 9 genomes in the PA-NP class, it is difficult to have any robust result. So, since also a same profile is found on the NPA-NP class and in the PA-NP class, we merged them in the NP class and we tried to predict the two classes NP and P using the 59 genomes.
 Description des 2 classes pour les 59 génomes
 ===============================================
 
 Num Effectif Pourcentage
 NP 1 16 27 %
 P 2 43 73 %
 
Below are the counts and percentages for these two classes that we also tried to order:

Decreasing global counts for the 12 secondary metabolites data =============================================================== NP P total CFS 16 40 56 nrps 14 41 55 siderophore 13 39 52 t1pks.nrps 5 34 39 thiopeptide 16 6 22 HSER 15 6 21 HSER.arylpolyene.CFS 7 1 8 arylpolyene 3 2 5 butyrolactone 4 1 5 other 2 3 5 terpene 5 0 5 t1pks 0 3 3 Decreasing percentages for the 12 secondary metabolites data ============================================================= NP P thiopeptide 100 % 14 % CFS 100 % 93 % HSER 94 % 14 % nrps 88 % 95 % siderophore 81 % 91 % HSER.arylpolyene.CFS 44 % 2 % terpene 31 % 0 % t1pks.nrps 31 % 79 % butyrolactone 25 % 2 % arylpolyene 19 % 5 % other 12 % 7 % t1pks 0 % 7 %

And the data can be found here (class 1 is NP, class 2 is P):

class='cadrejaune' id='pnpdata' rows='15' cols='180'> Classe Especies arylpolyene butyrolactone CFS HSER HSER.arylpolyene.CFS nrps other siderophore t1pks t1pks.nrps terpene thiopeptide Erwinia_sp.|9145 0 0 1 1 1 0 0 1 0 0 1 1 Erwinia_sp.|B116 0 1 1 1 1 1 0 0 0 0 0 1 Erwinia_sp.|ErVv1 0 0 1 1 0 1 0 1 0 1 1 1 Erwinia_sp.|Leaf53 0 1 1 1 0 1 0 0 0 0 0 1 Erwinia_iniecta|B120 1 0 1 1 0 1 1 1 0 1 0 1 Erwinia_iniecta|B149 1 0 1 1 0 1 1 1 0 1 0 1 Erwinia_sp.|OLMDLW33 0 1 1 1 0 1 0 1 0 0 0 1 Erwinia_gerundensis|NA 0 0 1 1 1 1 0 1 0 0 1 1 Erwinia_billingiae|Eb661 0 0 1 1 1 1 0 1 0 0 0 1 Erwinia_oleae|DAPP-PG531 0 0 1 1 1 1 0 1 0 1 1 1 Erwinia_typographi|M043b 0 0 1 1 0 0 0 1 0 1 0 1 Erwinia_billingiae|MYb121 0 0 1 1 1 1 0 1 0 0 0 1 Erwinia_billingiae|OSU19-1 0 0 1 1 1 1 0 1 0 0 0 1 Erwinia_tasmaniensis|Et1_99 0 1 1 1 0 1 0 1 0 0 0 1 Erwinia_toletana|DAPP-PG735 1 0 1 1 0 1 0 1 0 0 1 1 1 Erwinia_teleogrylli|SCU-B244 0 0 1 0 0 1 0 0 0 0 0 1 Erwinia_amylovora|E-2 0 0 1 0 0 1 0 1 0 1 0 0 Erwinia_amylovora|OR1 0 0 1 0 0 1 0 1 0 1 0 0 Erwinia_amylovora|OR6 0 0 1 0 0 1 0 1 0 1 0 0 Erwinia_persicina|B64 1 1 0 1 0 1 0 0 1 0 0 1 Erwinia_amylovora|CA3R 0 0 1 0 0 1 0 1 0 1 0 0 Erwinia_amylovora|EA110 0 0 1 0 0 1 0 1 0 1 0 0 Erwinia_amylovora|Ea266 0 0 1 0 0 1 0 1 0 1 0 0 Erwinia_amylovora|Ea356 0 0 1 0 0 1 0 1 0 1 0 0 Erwinia_amylovora|Ea644 0 0 1 0 0 1 0 1 0 1 0 0 Erwinia_amylovora|LA092 0 0 1 0 0 1 0 1 0 1 0 0 Erwinia_amylovora|LA635 0 0 1 0 0 1 0 1 0 1 0 0 Erwinia_amylovora|LA636 0 0 1 0 0 1 0 1 0 1 0 0 Erwinia_amylovora|LA637 0 0 1 0 0 1 0 1 0 1 0 0 Erwinia_amylovora|UT5P4 0 0 1 0 0 1 0 1 0 1 0 0 Erwinia_amylovora|UPN527 0 0 1 0 0 1 0 1 0 1 0 0 Erwinia_amylovora|CTBT1-1 0 0 1 0 0 1 0 1 0 1 0 0 Erwinia_amylovora|CTBT3-1 0 0 1 0 0 1 0 1 0 1 0 0 Erwinia_amylovora|MAGFLF2 0 0 1 0 0 1 0 1 0 1 0 0 Erwinia_pyrifoliae|Ejp617 0 0 0 0 0 1 0 1 0 0 0 0 Erwinia_pyrifoliae|Ep1_96 0 0 1 0 0 1 0 1 0 0 0 0 Erwinia_amylovora|01SFR-BO 0 0 1 0 0 1 0 1 0 1 0 0 Erwinia_amylovora|ACW56400 0 0 1 0 0 1 0 1 0 1 0 0 Erwinia_amylovora|CFBP1232 0 0 1 0 0 1 0 1 0 1 0 0 Erwinia_amylovora|CFBP2585 0 0 1 0 0 1 0 1 0 1 0 0 Erwinia_amylovora|CFPB1430 0 0 1 0 0 1 0 1 0 1 0 0 Erwinia_amylovora|CTMF03-1 0 0 1 0 0 1 0 1 0 1 0 0 Erwinia_amylovora|CTST01-1 0 0 1 0 0 1 0 1 0 1 0 0 Erwinia_amylovora|MANB02-1 0 0 1 0 0 1 0 1 0 1 0 0 Erwinia_amylovora|NHSB01-1 0 0 1 0 0 1 0 1 0 1 0 0 Erwinia_amylovora|NHWL02-2 0 0 1 0 0 1 0 1 0 1 0 0 Erwinia_amylovora|VTBL01-1 0 0 1 0 0 1 0 1 0 1 0 0 Erwinia_amylovora|VTDMSF02 0 0 1 0 0 1 0 1 0 1 0 0 Erwinia_pyrifoliae|EpK1_15 0 0 0 0 0 1 0 1 0 0 0 0 Erwinia_tracheiphila|PSU-1 0 0 1 1 0 0 1 0 0 0 0 1 Erwinia_amylovora|ATCC49946 0 0 1 0 0 1 0 1 0 1 0 0 Erwinia_amylovora|NBRC12687 0 0 1 0 0 1 0 1 0 1 0 0 Erwinia_amylovora|WSDA87-73 0 0 1 0 0 1 0 1 0 1 0 0 Erwinia_pyrifoliae|DSM12163 0 0 1 0 0 1 0 1 0 0 0 0 Erwinia_tracheiphila|BuffGH 0 0 1 1 0 0 1 0 0 0 0 1 2 Erwinia_amylovora|RISTBO01-2 0 0 1 0 0 1 0 1 0 1 0 0 2 Erwinia_mallotivora|BT-MARDI 0 0 1 1 1 1 1 1 1 0 0 1 2 Erwinia_persicina|NBRC102418 1 0 1 1 0 1 0 0 1 0 0 1 2 Erwinia_piriflorinigrans|CFBP5888 0 0 1 1 0 1 0 1 0 1 0 1 There is no obvious solution to our problem. A graphical representation such as a count heatmap makes it easier to see the data. Columns are ordered with maximum count of positive NP and minimal count of positive P. <a href='../Tables-oct/ch03-01.png'><img src='../Tables-oct/ch03-01.png' alt='non su' width='800' /></a> Using the three classes: <a href='../Tables-oct/ch03-00.png'><img src='../Tables-oct/ch03-00.png' alt='non su' width='800' /></a> It may be easier to see the classes with three different graphics: <a href='../Tables-oct/pre02NPNP.png'><img src='../Tables-oct/pre02NPNP.png' alt='non su' width='800' /></a> <a href='../Tables-oct/pre02NPPA.png'><img src='../Tables-oct/pre02NPPA.png' alt='non su' width='800' /></a> <a href='../Tables-oct/pre02PATH.png'><img src='../Tables-oct/pre02PATH.png' alt='non su' width='800' /></a> A simple binary logistic regression with only one variable at a time gives a first idea of the importance of each variables on the prediction of pathogenicity: <pre class='cadre'> Binary Logistic Regression with only one column =============================================== Column AUROC NP P thiopeptide 0.9302 100 % 14 % HSER 0.8990 94 % 14 % t1pks.nrps 0.7391 31 % 79 % HSER.arylpolyene.CFS 0.7071 44 % 2 % terpene 0.6562 31 % 0 % butyrolactone 0.6134 25 % 2 % arylpolyene 0.5705 19 % 5 % siderophore 0.5472 81 % 91 % nrps 0.5392 88 % 95 % CFS 0.5349 100 % 93 % t1pks 0.5349 0 % 7 % other 0.5276 12 % 7 % </pre> A complete multiple binary logistic regression is able to predict the class using 9 variables only: <pre class='cadre'> Coefficients Estimate Std. Error z value Pr(>|z|) (Intercept) 28.04 161829.67 0 1 arylpolyene -49.14 166878.33 0 1 butyrolactone -97.19 214196.19 0 1 HSER 99.68 402750.77 0 1 HSER.arylpolyene.CFS -49.15 145106.92 0 1 nrps 47.95 140693.62 0 1 siderophore -49.42 167965.54 0 1 t1pks 98.05 246528.08 0 1 terpene -47.97 141612.77 0 1 thiopeptide -102.55 398075.87 0 1 </pre> A graphical representation of these 9 variables only is helpful: <a href='../Tables-oct/ch03-02.png'><img src='../Tables-oct/ch03-02.png' alt='non su' width='800' /></a> Using the three classes: <a href='../Tables-oct/ch03-03.png'><img src='../Tables-oct/ch03-03.png' alt='non su' width='800' /></a> Here again, it may be easier to see the classes with three different graphics: <a href='../Tables-oct/pre02NPNP9.png'><img src='../Tables-oct/pre02NPNP9.png' alt='non su' width='800' /></a> <a href='../Tables-oct/pre02NPPA9.png'><img src='../Tables-oct/pre02NPPA9.png' alt='non su' width='800' /></a> <a href='../Tables-oct/pre02PATH9.png'><img src='../Tables-oct/pre02PATH9.png' alt='non su' width='800' /></a> </blockquote> <h2 id='tdm3'>3. Concluding remarks</h2> <blockquote> Though we do not have a lot of genomes in each class, it is possible to build a logistic regression model that is able to predict pathogenicity. </blockquote> </td></tr></table>   <a href="http://www.info.univ-angers.fr/~gh/"> <img src="return.gif" alt="retour gH" /></a> Retour à la page principale de (gH) </blockquote> </body> </html>

PREDIPATH predictions using the secondary metabolites data

Clickable table of contents

1. Preparation and description of the data

2. Prediction for the P and NP classes