Valid XHTML     Valid CSS2    

Exercice de révision de la séance 1

 
http://forge.info.univ-angers.fr/~gh/wstat/Introduction_R/revision1.php
 

          

          non su

Essayer de lire les données DIABETE pour en faire un data.frame avec comme seules colonnes id glyhb age gender height weight waist hip que l'on viendra convertir en unités métriques. Il s'agit du jeu de données diabete extrait des datasets du professeur F. Harrell fournies pour son livre Regression Modeling Strategies. On se méfiera des nombreuses données manquantes.

On construira ensuite une variable "diabète diagnostiqué" suite à la lecture de l'abstract pubmed/9258308. (extrait encadré ci-dessous). Pour cet exercice, on pourra donc ignorer les variables chol, stab.hlu, hdl, ratio, location, frame, bp.1s, bp.1d et time.ppn puis ajouter une variable rapport ceinture/hanches nommée rach.

Remarque : si vous n'arrivez pas à lire le fichier Excel, vous pouvez vous rabattre sur le fichier diabete.dar qui a les mêmes données manquantes, repérées par NA.


     Prevalence of coronary heart disease risk factors among rural blacks: a community-based study
     (diabetes Dataset).
     
     These data are courtesy of Dr John Schorling, Department of Medicine,
     University of Virginia School of Medicine.
     
     The data consist of 19 variables on 403 subjects from 1046 subjects who
     were interviewed in a study to understand the prevalence of obesity, diabetes,
     and other cardiovascular risk factors in central Virginia for African Americans.
     According to Dr John Hong, Diabetes Mellitus Type II (adult onset diabetes) is
     associated most strongly with obesity. The waist/hip ratio may be a predictor
     in diabetes and heart disease. DM II is also agssociated with hypertension
     - they may both be part of "Syndrome X". The 403 subjects were the ones who
     were actually screened for diabetes. Glycosolated hemoglobin > 7.0 is usually
     taken as a positive diagnosis of diabetes.
     
     Background. Coronary heart disease (CHD) remains the most common cause of
     death among blacks, and the difference in CHD mortality between blacks
     and whites is growing. This trend may be due in part to higher rates of
     CHD risk factors among blacks. This study was done to determine the
     prevalence of CHD risk factors among a population-based sample of 403
     rural blacks in Virginia.
     
     Methods. Community-based screening evaluations included the
     determination of exercise and smoking habits, blood pressure,
     height, weight, total and high-density lipoprotein (HDL)
     cholesterol, and glycosylated hemoglobin.
     
     (C) 1997 Southern Medical Association
     
     

Il faudra peut être utiliser les fonctions suivantes

                read.xls()       du package       gdata
                summary()       du package       base
                <-       du package       base
                [       du package       base
                get()       du package       base
                na.omit()       du package       stats
                attach()       du package       base
                detach()       du package       base
                ifelse()       du package       base
                levels()       du package       base

Solution

 
http://forge.info.univ-angers.fr/~gh/wstat/Introduction_R/revision1.php?sol=1
 


     # chargement des fonctions (gH)     
          
     source("http://forge.info.univ-angers.fr/~gh/wstat/statgh.r", encoding="latin1" )     
          
     # lecture des données Excel     
          
     library(gdata)     
     diabete <- read.xls("http://forge.info.univ-angers.fr/~gh/wstat/Introduction_R/diabete.xls")     
          
     # à la fac, gdata n'est pas installé, il faut donc écrire :     
     #   install.packages("gdata",loc="d:/")     
     # avant de pouvoir éxécuter     
     #   library(gdata)     
          
     #############################################################################################     
     #     
     #   si gdata ne fonctionne pas, on peut utiliser XLConnect (mais avec un fichier local)     
     #     
     #      library(XLConnect)     
     #      wb       <- loadWorkbook("diabete.xls")     
     #      diabete  <- readWorksheet(wb,"diabetes")     
     #     
     #   au pire, on peut utiliser les données en .DAR :     
     #     
     #      diabete <- lit.dar("http://forge.info.univ-angers.fr/~gh/wstat/Introduction_R/diabete.dar")     
     #     
     #############################################################################################     
          
     # regardons un résumé des données     
          
     cats("résumé des données")     
     print(summary(diabete))     
          
     # na.omit() appliqué à toutes les données     
     # est un peu trop brutal     
          
     cats("si on supprime toute ligne avec NA")     
     d1 <- na.omit(diabete)     
     print(summary(d1))     
          
     # gardons juste les colonnes qui nous intéressent     
          
     cats("avec juste les colonnes à garder")     
     d2 <- diabete[ , c("id","glyhb","age","gender","height","weight","waist","hip") ]     
     print(summary(d2))     
          
     # et maintenant on enlève les données manquantes     
          
     cats("si on supprime ensuite toute ligne avec NA")     
     d3 <- na.omit(d2)     
     print(summary(d3))     
          
     # apprenons au passage les diverses façon d'affecter     
          
     assign("diab",d3)     
     "<-"(diab,d3)     
     diab <- get("d3")     
     diab <- d3     
          
     cat("\n\nau final, on travaille sur des données de taille ",dim(diab),"\n")     
          
     # construisons les variables françaises     
          
     # conversion de height (pouces) et weight (pounds) en taille (cm) et poids (kg)     
     # et conversion de waist, hip (pouces) en ceinture et hanches (cm)     
          
     attach(diab)     
          
     taille   <- height * 2.54     
     poids    <- weight * 0.45359237     
     ceinture <- waist  * 2.54     
     hanches  <- hip    * 2.54     
          
     #     
     # si on veut créer des colonnes dans le dataframe :     
     #     
     #  diab <- transform(diab,taille=height * 2.54, poids=weight * 0.45359237, etc...     
     #     
          
     # calcul du rapport ceinture/hanches nommé rach     
          
     rach     <- ceinture/hanches     
          
     # diagnostic de diabète à partir de la variable glyhb     
     # codage : 0=sans diabète, 1=avec diabète     
          
     diabete  <- ifelse(glyhb<=7,0,1)     
          
     # on traduit la variable gender en sexe     
          
     sexe         <- gender     
     levels(sexe) <- c("femme","homme")     
          
     detach(diab)     
          
     # on ajoute ces variables au "dataframe"     
          
     diabfr   <- cbind(diab, taille, poids, ceinture, hanches, rach, sexe, diabete )     
          
     # voici le début des données     
          
     cats("les premières lignes de diabfr")     
     print(head(diabfr))     
          
          

          
     résumé des données     
     ==================     
          
           Pnum             id             chol          stab.glu          hdl             ratio            glyhb             location        age           gender        height     
      Min.   :  1.0   Min.   : 1000   Min.   : 78.0   Min.   : 48.0   Min.   : 12.00   Min.   : 1.500   Min.   : 2.68   Buckingham:200   Min.   :19.00   female:234   Min.   :52.00     
      1st Qu.:101.5   1st Qu.: 4792   1st Qu.:179.0   1st Qu.: 81.0   1st Qu.: 38.00   1st Qu.: 3.200   1st Qu.: 4.38   Louisa    :203   1st Qu.:34.00   male  :169   1st Qu.:63.00     
      Median :202.0   Median :15766   Median :204.0   Median : 89.0   Median : 46.00   Median : 4.200   Median : 4.84                    Median :45.00                Median :66.00     
      Mean   :202.0   Mean   :15978   Mean   :207.8   Mean   :106.7   Mean   : 50.45   Mean   : 4.522   Mean   : 5.59                    Mean   :46.85                Mean   :66.02     
      3rd Qu.:302.5   3rd Qu.:20336   3rd Qu.:230.0   3rd Qu.:106.0   3rd Qu.: 59.00   3rd Qu.: 5.400   3rd Qu.: 5.60                    3rd Qu.:60.00                3rd Qu.:69.00     
      Max.   :403.0   Max.   :41756   Max.   :443.0   Max.   :385.0   Max.   :120.00   Max.   :19.300   Max.   :16.11                    Max.   :92.00                Max.   :76.00     
                                      NA's   :1                       NA's   :1        NA's   :1        NA's   :13                                                    NA's   :5     
          weight         frame         bp.1s           bp.1d            bp.2s           bp.2d            waist           hip           time.ppn     
      Min.   : 99.0         : 12   Min.   : 90.0   Min.   : 48.00   Min.   :110.0   Min.   : 60.00   Min.   :26.0   Min.   :30.00   Min.   :   5.0     
      1st Qu.:151.0   large :103   1st Qu.:121.2   1st Qu.: 75.00   1st Qu.:138.0   1st Qu.: 84.00   1st Qu.:33.0   1st Qu.:39.00   1st Qu.:  90.0     
      Median :172.5   medium:184   Median :136.0   Median : 82.00   Median :149.0   Median : 92.00   Median :37.0   Median :42.00   Median : 240.0     
      Mean   :177.6   small :104   Mean   :136.9   Mean   : 83.32   Mean   :152.4   Mean   : 92.52   Mean   :37.9   Mean   :43.04   Mean   : 341.2     
      3rd Qu.:200.0                3rd Qu.:146.8   3rd Qu.: 90.00   3rd Qu.:161.0   3rd Qu.:100.00   3rd Qu.:41.0   3rd Qu.:46.00   3rd Qu.: 517.5     
      Max.   :325.0                Max.   :250.0   Max.   :124.00   Max.   :238.0   Max.   :124.00   Max.   :56.0   Max.   :64.00   Max.   :1560.0     
      NA's   :1                    NA's   :5       NA's   :5        NA's   :262     NA's   :262      NA's   :2      NA's   :2       NA's   :3     
          
     si on supprime toute ligne avec NA     
     ==================================     
          
           Pnum             id             chol          stab.glu          hdl             ratio            glyhb              location       age          gender       height     
      Min.   :  3.0   Min.   : 1002   Min.   :134.0   Min.   : 48.0   Min.   : 23.00   Min.   : 2.200   Min.   : 2.850   Buckingham:50   Min.   :20.0   female:76   Min.   :52.00     
      1st Qu.:107.0   1st Qu.: 4801   1st Qu.:184.8   1st Qu.: 85.0   1st Qu.: 37.00   1st Qu.: 3.300   1st Qu.: 4.397   Louisa    :86   1st Qu.:40.0   male  :60   1st Qu.:63.00     
      Median :208.0   Median :15790   Median :212.0   Median : 94.0   Median : 46.00   Median : 4.500   Median : 4.960                   Median :51.0               Median :66.00     
      Mean   :214.6   Mean   :18260   Mean   :218.1   Mean   :115.3   Mean   : 50.62   Mean   : 4.764   Mean   : 5.942                   Mean   :51.3               Mean   :65.59     
      3rd Qu.:339.5   3rd Qu.:21327   3rd Qu.:242.2   3rd Qu.:113.5   3rd Qu.: 59.25   3rd Qu.: 5.600   3rd Qu.: 6.482                   3rd Qu.:63.0               3rd Qu.:69.00     
      Max.   :400.0   Max.   :41507   Max.   :443.0   Max.   :369.0   Max.   :120.00   Max.   :19.300   Max.   :16.110                   Max.   :89.0               Max.   :74.00     
          weight         frame        bp.1s           bp.1d            bp.2s           bp.2d            waist            hip           time.ppn     
      Min.   :102.0         : 6   Min.   :100.0   Min.   : 60.00   Min.   :110.0   Min.   : 60.00   Min.   :28.00   Min.   :33.00   Min.   :  10.0     
      1st Qu.:156.0   large :36   1st Qu.:141.5   1st Qu.: 88.75   1st Qu.:138.8   1st Qu.: 84.00   1st Qu.:35.00   1st Qu.:40.00   1st Qu.:  87.5     
      Median :179.0   medium:72   Median :150.0   Median : 94.00   Median :150.0   Median : 92.00   Median :38.00   Median :43.00   Median : 240.0     
      Mean   :184.2   small :22   Mean   :155.6   Mean   : 94.13   Mean   :152.9   Mean   : 92.52   Mean   :39.12   Mean   :44.14   Mean   : 306.5     
      3rd Qu.:210.0               3rd Qu.:166.2   3rd Qu.:100.00   3rd Qu.:161.2   3rd Qu.:100.00   3rd Qu.:43.00   3rd Qu.:47.00   3rd Qu.: 397.5     
      Max.   :290.0               Max.   :230.0   Max.   :124.00   Max.   :238.0   Max.   :124.00   Max.   :55.00   Max.   :64.00   Max.   :1440.0     
          
     avec juste les colonnes à garder     
     ================================     
          
            id            glyhb            age           gender        height          weight          waist           hip     
      Min.   : 1000   Min.   : 2.68   Min.   :19.00   female:234   Min.   :52.00   Min.   : 99.0   Min.   :26.0   Min.   :30.00     
      1st Qu.: 4792   1st Qu.: 4.38   1st Qu.:34.00   male  :169   1st Qu.:63.00   1st Qu.:151.0   1st Qu.:33.0   1st Qu.:39.00     
      Median :15766   Median : 4.84   Median :45.00                Median :66.00   Median :172.5   Median :37.0   Median :42.00     
      Mean   :15978   Mean   : 5.59   Mean   :46.85                Mean   :66.02   Mean   :177.6   Mean   :37.9   Mean   :43.04     
      3rd Qu.:20336   3rd Qu.: 5.60   3rd Qu.:60.00                3rd Qu.:69.00   3rd Qu.:200.0   3rd Qu.:41.0   3rd Qu.:46.00     
      Max.   :41756   Max.   :16.11   Max.   :92.00                Max.   :76.00   Max.   :325.0   Max.   :56.0   Max.   :64.00     
                      NA's   :13                                   NA's   :5       NA's   :1       NA's   :2      NA's   :2     
          
     si on supprime ensuite toute ligne avec NA     
     ==========================================     
          
            id            glyhb             age           gender        height          weight          waist            hip     
      Min.   : 1000   Min.   : 2.680   Min.   :19.00   female:222   Min.   :52.00   Min.   : 99.0   Min.   :26.00   Min.   :30.00     
      1st Qu.: 4792   1st Qu.: 4.390   1st Qu.:34.00   male  :160   1st Qu.:63.00   1st Qu.:151.0   1st Qu.:33.00   1st Qu.:39.00     
      Median :15770   Median : 4.840   Median :45.00                Median :66.00   Median :173.5   Median :37.00   Median :42.00     
      Mean   :15979   Mean   : 5.577   Mean   :46.85                Mean   :65.99   Mean   :177.7   Mean   :37.92   Mean   :43.04     
      3rd Qu.:20331   3rd Qu.: 5.600   3rd Qu.:60.00                3rd Qu.:69.00   3rd Qu.:200.0   3rd Qu.:41.00   3rd Qu.:46.00     
      Max.   :41752   Max.   :16.110   Max.   :92.00                Max.   :76.00   Max.   :325.0   Max.   :56.00   Max.   :64.00     
          
          
     au final, on travaille sur des données de taille  382 8     
          
     les premières lignes de diabfr     
     ==============================     
          
         id glyhb age gender height weight waist hip taille     poids ceinture hanches      rach  sexe diabete     
     1 1000  4.31  46 female     62    121    29  38 157.48  54.88468    73.66   96.52 0.7631579 femme       0     
     2 1001  4.44  29 female     64    218    46  48 162.56  98.88314   116.84  121.92 0.9583333 femme       0     
     3 1002  4.64  58 female     61    256    49  57 154.94 116.11965   124.46  144.78 0.8596491 femme       0     
     4 1003  4.63  67   male     67    119    33  38 170.18  53.97749    83.82   96.52 0.8684211 homme       0     
     5 1005  7.72  64   male     68    183    44  41 172.72  83.00740   111.76  104.14 1.0731707 homme       1     
     6 1008  4.81  34   male     71    190    36  42 180.34  86.18255    91.44  106.68 0.8571429 homme       0     
          

 

 

retour gH    Retour à la page principale de   (gH)