Valid XHTML     Valid CSS2    

PREDIPATH predictions using the GWAS specific k-mers

Clickable table of contents

  1. Preparation and description of the data

  2. Prediction for the P and NP classes

  3. Concluding remarks

1. Preparation and description of the data

Data was given as a table of 59 lines by 512 columns. The 59 genomes were annotated NPA-NP, PA-NP or PA-P (NPA: non plant associated, PA: plant associated, NP: non pathogenic, P: pathogenic). Here are some names of the columns, which are obviously non explicit:


     512 Colums:
     ===========
     
           name
     [1,] "X977035"
     [2,] "X353930"
     [3,] "X660456"
     [4,] "X957389"
     [5,] "X761692"
     [6,] "X1081881"
     [...]
     [507,] "X1096408"
     [508,] "X1126723"
     [509,] "X841270"
     [510,] "X1077409"
     [511,] "X691797"
     [512,] "X1139556"
     

With a small R script, we checked that there were no constant columns.

With another small R script, it was easy to get rid of 326 columns equal to other columns, leaving using with only 186 columns:

For these remaining columns, we removed 39 columns with a near zero variance:


           names2rm
      [1,] "X15477"
      [2,] "X45477"
      [3,] "X54770"
      [4,] "X54771"
      [5,] "X54776"
      [6,] "X54777"
      [7,] "X54778"
      [8,] "X54779"
      [9,] "X55477"
     [10,] "X75477"
     [11,] "X85477"
     [12,] "X95477"
     [13,] "X105477"
     [14,] "X115477"
     [15,] "X125477"
     [16,] "X145477"
     [17,] "X154770"
     [18,] "X154777"
     [19,] "X154778"
     [20,] "X165477"
     [21,] "X254771"
     [22,] "X254774"
     [23,] "X254777"
     [24,] "X315477"
     [25,] "X454772"
     [26,] "X454773"
     [27,] "X454778"
     [28,] "X525477"
     [29,] "X547713"
     [30,] "X547741"
     [31,] "X547744"
     [32,] "X547745"
     [33,] "X547759"
     [34,] "X547799"
     [35,] "X625477"
     [36,] "X654771"
     [37,] "X954777"
     [38,] "X954778"
     [39,] "X1054771"
     

So for the remaining part of the analysis, we had only 147 columns and 59 genomes in three classes:

We tried to order the percentages of the presence in the three classes:

and to order also the counts of the presence in the three classes:

Some lines were equal, with distinct classes, thus leading to a contradiction to discriminate between the classes. 33 lines were removed, so only 26 profiles of lines remain.


     Detection of equal lines and contradictions
     ===========================================
     
     33 lines removed
        Line              Equals
     1  NPNP-000773975.1
     2  NPNP-001267535.1  3
     4  NPNP-001484765.1
     5  NPNP-002752575.1
     6  NPNP-002865965.1  14
     7  NPNP-900068895.1
     8  NPPA-000026185.1
     9  NPPA-000196615.1
     10 NPPA-000336255.1
     11 NPPA-000745075.1
     12 NPPA-000770305.1
     13 NPPA-001269445.1
     15 NPPA-001517405.1
     16 NPPA-002980095.1
     17 PATH-000026985.1  19 58
     18 PATH-000027205.1  20 22 23 24 25 26 27 28 31 32 33 35 42 43 44 47 51 56 57
     21 PATH-000165815.1
     29 PATH-000367665.1
     30 PATH-000404125.1  36
     34 PATH-000590885.1
     37 PATH-001050515.1
     38 PATH-001571305.1  59
     39 PATH-002732125.1  40 41 46 49 50 53 55
     45 PATH-002732285.1  52
     48 PATH-002732335.1
     54 PATH-002732445.1
     
     Classes and contradictions between classes
     ==========================================
     
           Line             Class Equal.Lines Classes
      [1,] NPNP-000773975.1 0     1           0
      [2,] NPNP-001267535.1 0     2           0 0
      [3,] NPNP-001484765.1 0     1           0
      [4,] NPNP-002752575.1 0     1           0
      [5,] NPNP-002865965.1 0     2           0 1
      [6,] NPNP-900068895.1 0     1           0
      [7,] NPPA-000026185.1 1     1           1
      [8,] NPPA-000196615.1 1     1           1
      [9,] NPPA-000336255.1 1     1           1
     [10,] NPPA-000745075.1 1     1           1
     [11,] NPPA-000770305.1 1     1           1
     [12,] NPPA-001269445.1 1     1           1
     [13,] NPPA-001517405.1 1     1           1
     [14,] NPPA-002980095.1 1     1           1
     [15,] PATH-000026985.1 2     3           2 2 2
     [16,] PATH-000027205.1 2     20          2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
     [17,] PATH-000165815.1 2     1           2
     [18,] PATH-000367665.1 2     1           2
     [19,] PATH-000404125.1 2     2           2 2
     [20,] PATH-000590885.1 2     1           2
     [21,] PATH-001050515.1 2     1           2
     [22,] PATH-001571305.1 2     2           2 2
     [23,] PATH-002732125.1 2     8           2 2 2 2 2 2 2 2
     [24,] PATH-002732285.1 2     2           2 2
     [25,] PATH-002732335.1 2     1           2
     [26,] PATH-002732445.1 2     1           2
     

Here are the genomes that are an issue:


                                    Classe X353930 X660456 X957389 X761692 X1081881 X1125434 X1026271 X626933 X868767 X164361 X960658 X1046571 X796993 X400739 X577828 X688541 X1111828 X907054 X745798 X416115 X668025 X1012033 X1140694 X624647 X1119070 X388437 X1106871 X777169 X508965 X704021 X372564 X1107946 X606155 X948978 X1033842 X231764 X780124 X1085719 X263637 X1126485 X1081844 X1085951 X424521 X567309 X807761 X1097537 X1009922 X5477 X25477 X35477 X54775 X135477 X154773 X154776 X154779 X215477 X305477 X385477 X454779 X547715 X547729 X547738 X547757 X547770 X645477 X654772 X755477 X765477 X854770 X854772 X854777 X915477 X925477 X935477 X954771 X954772 X954776 X965477 X975477 X985477 X1005477 X1025477 X1045477 X1054770 X1054772 X1054773 X1054774 X1054777 X1065477 X1075477 X1085477 X1095477 X1115477 X1125477 X1137966 X393704 X279468 X602772 X1032810 X489606 X427449 X1122208 X1081410 X424856 X1059101 X859226 X1034797 X774086 X912369 X777697 X427629 X347777 X895901
     002865965.1-Erwinia_sp.|B116        0       1       1       0       0        1        0        1       1       0       0       0        1       0       0       1       1        0       1       1       0       0        0        1       1        0       0        1       1       1       1       1        0       1       0        0       0       1        0       0        1        0        0       1       1       1        0        0     0      0      0      0       1       1       1       1       1       0       0       1       0       1       0       1       0       0       0       0       1       0       0       0       0       1       0       0       0       0       0       0       1        1        0        0        1        0        0        0        1        0        1        0        0        0        0        1       0       0       0        0       0       0        1        1       0        1       1        1       1       0       0       0       1       0
     001422605.1-Erwinia_sp.|Leaf53      1       1       1       0       0        1        0        1       1       0       0       0        1       0       0       1       1        0       1       1       0       0        0        1       1        0       0        1       1       1       1       1        0       1       0        0       0       1        0       0        1        0        0       1       1       1        0        0     0      0      0      0       1       1       1       1       1       0       0       1       0       1       0       1       0       0       0       0       1       0       0       0       0       1       0       0       0       0       0       0       1        1        0        0        1        0        0        0        1        0        1        0        0        0        0        1       0       0       0        0       0       0        1        1       0        1       1        1       1       0       0       0       1       0
     

2. Prediction for the P and NP classes

With only 7 genomes in class NPA-NP and 9 genomes in the PA-NP class, it is difficult to have any robust result. So, since also a same profile is found on the NPA-NP class and in the PA-NP class, we merged them in the NP class and we tried to predict the two classes NP and P using the 59 genomes.


     Description of the 2 classes for the 59 genomes
     ===============================================
     
              Num Effectif Pourcentage
           NP   1       16        27 %
           P    2       43        73 %
     

Below are the counts and percentages for these two classes that we also tried to order:

There is an obvious solution to our problem here: k-mer X774086 is 100% present for NP and 100% absent for PP.

One should also be interested in k-mer X1126723 which is 100% present for NP and 98% absent for PP (it is present only in 1 genome).

3. Concluding remarks

Though we do not have a lot of genomes in each class, it is possible to find a specific k-mer that is always associated with non pathogenicity and never associated with pathogenicity.

 

 

retour gH    Retour à la page principale de   (gH)