Valid XHTML     Valid CSS2    

Exercice de r�vision de la s�ance 1

 
http://forge.info.univ-angers.fr/~gh/wstat/Introduction_R/revision1.php
 

          

          non su

Essayer de lire les donn�es DIABETE pour en faire un data.frame avec comme seules colonnes id glyhb age gender height weight waist hip que l'on viendra convertir en unit�s m�triques. Il s'agit du jeu de donn�es diabete extrait des datasets du professeur F. Harrell fournies pour son livre Regression Modeling Strategies. On se m�fiera des nombreuses donn�es manquantes.

On construira ensuite une variable "diab�te diagnostiqu�" suite � la lecture de l'abstract pubmed/9258308. (extrait encadr� ci-dessous). Pour cet exercice, on pourra donc ignorer les variables chol, stab.hlu, hdl, ratio, location, frame, bp.1s, bp.1d et time.ppn puis ajouter une variable rapport ceinture/hanches nomm�e rach.

Remarque : si vous n'arrivez pas � lire le fichier Excel, vous pouvez vous rabattre sur le fichier diabete.dar qui a les mêmes donn�es manquantes, rep�r�es par NA.

     Prevalence of coronary heart disease risk factors among rural blacks: a community-based study
     (diabetes Dataset).
     
     These data are courtesy of Dr John Schorling, Department of Medicine,
     University of Virginia School of Medicine.
     
     The data consist of 19 variables on 403 subjects from 1046 subjects who
     were interviewed in a study to understand the prevalence of obesity, diabetes,
     and other cardiovascular risk factors in central Virginia for African Americans.
     According to Dr John Hong, Diabetes Mellitus Type II (adult onset diabetes) is
     associated most strongly with obesity. The waist/hip ratio may be a predictor
     in diabetes and heart disease. DM II is also agssociated with hypertension
     - they may both be part of "Syndrome X". The 403 subjects were the ones who
     were actually screened for diabetes. Glycosolated hemoglobin > 7.0 is usually
     taken as a positive diagnosis of diabetes.
     
     Background. Coronary heart disease (CHD) remains the most common cause of
     death among blacks, and the difference in CHD mortality between blacks
     and whites is growing. This trend may be due in part to higher rates of
     CHD risk factors among blacks. This study was done to determine the
     prevalence of CHD risk factors among a population-based sample of 403
     rural blacks in Virginia.
     
     Methods. Community-based screening evaluations included the
     determination of exercise and smoking habits, blood pressure,
     height, weight, total and high-density lipoprotein (HDL)
     cholesterol, and glycosylated hemoglobin.
     
     (C) 1997 Southern Medical Association
     
     

Il faudra peut être utiliser les fonctions suivantes

                read.xls()       du package       gdata
                summary()       du package       base
                <-       du package       base
                [       du package       base
                get()       du package       base
                na.omit()       du package       stats
                attach()       du package       base
                detach()       du package       base
                ifelse()       du package       base
                levels()       du package       base

Solution

 
http://forge.info.univ-angers.fr/~gh/wstat/Introduction_R/revision1.php?sol=1
 

     # chargement des fonctions (gH)     
          
     source("http://forge.info.univ-angers.fr/~gh/wstat/statgh.r", encoding="latin1" )     
          
     # lecture des donn�es Excel     
          
     library(gdata)     
     diabete <- read.xls("http://forge.info.univ-angers.fr/~gh/wstat/Introduction_R/diabete.xls")     
          
     # � la fac, gdata n'est pas install�, il faut donc �crire :     
     #   install.packages("gdata",loc="d:/")     
     # avant de pouvoir �x�cuter     
     #   library(gdata)     
          
     #############################################################################################     
     #     
     #   si gdata ne fonctionne pas, on peut utiliser XLConnect (mais avec un fichier local)     
     #     
     #      library(XLConnect)     
     #      wb       <- loadWorkbook("diabete.xls")     
     #      diabete  <- readWorksheet(wb,"diabetes")     
     #     
     #   au pire, on peut utiliser les donn�es en .DAR :     
     #     
     #      diabete <- lit.dar("http://forge.info.univ-angers.fr/~gh/wstat/Introduction_R/diabete.dar")     
     #     
     #############################################################################################     
          
     # regardons un r�sum� des donn�es     
          
     cats("r�sum� des donn�es")     
     print(summary(diabete))     
          
     # na.omit() appliqu� � toutes les donn�es     
     # est un peu trop brutal     
          
     cats("si on supprime toute ligne avec NA")     
     d1 <- na.omit(diabete)     
     print(summary(d1))     
          
     # gardons juste les colonnes qui nous int�ressent     
          
     cats("avec juste les colonnes � garder")     
     d2 <- diabete[ , c("id","glyhb","age","gender","height","weight","waist","hip") ]     
     print(summary(d2))     
          
     # et maintenant on enl�ve les donn�es manquantes     
          
     cats("si on supprime ensuite toute ligne avec NA")     
     d3 <- na.omit(d2)     
     print(summary(d3))     
          
     # apprenons au passage les diverses fa�on d'affecter     
          
     assign("diab",d3)     
     "<-"(diab,d3)     
     diab <- get("d3")     
     diab <- d3     
          
     cat("\n\nau final, on travaille sur des donn�es de taille ",dim(diab),"\n")     
          
     # construisons les variables fran�aises     
          
     # conversion de height (pouces) et weight (pounds) en taille (cm) et poids (kg)     
     # et conversion de waist, hip (pouces) en ceinture et hanches (cm)     
          
     attach(diab)     
          
     taille   <- height * 2.54     
     poids    <- weight * 0.45359237     
     ceinture <- waist  * 2.54     
     hanches  <- hip    * 2.54     
          
     #     
     # si on veut cr�er des colonnes dans le dataframe :     
     #     
     #  diab <- transform(diab,taille=height * 2.54, poids=weight * 0.45359237, etc...     
     #     
          
     # calcul du rapport ceinture/hanches nomm� rach     
          
     rach     <- ceinture/hanches     
          
     # diagnostic de diab�te � partir de la variable glyhb     
     # codage : 0=sans diab�te, 1=avec diab�te     
          
     diabete  <- ifelse(glyhb<=7,0,1)     
          
     # on traduit la variable gender en sexe     
          
     sexe         <- gender     
     levels(sexe) <- c("femme","homme")     
          
     detach(diab)     
          
     # on ajoute ces variables au "dataframe"     
          
     diabfr   <- cbind(diab, taille, poids, ceinture, hanches, rach, sexe, diabete )     
          
     # voici le d�but des donn�es     
          
     cats("les premi�res lignes de diabfr")     
     print(head(diabfr))     
          
          
          
     r�sum� des donn�es     
     ==================     
          
           Pnum             id             chol          stab.glu          hdl             ratio            glyhb             location        age           gender        height     
      Min.   :  1.0   Min.   : 1000   Min.   : 78.0   Min.   : 48.0   Min.   : 12.00   Min.   : 1.500   Min.   : 2.68   Buckingham:200   Min.   :19.00   female:234   Min.   :52.00     
      1st Qu.:101.5   1st Qu.: 4792   1st Qu.:179.0   1st Qu.: 81.0   1st Qu.: 38.00   1st Qu.: 3.200   1st Qu.: 4.38   Louisa    :203   1st Qu.:34.00   male  :169   1st Qu.:63.00     
      Median :202.0   Median :15766   Median :204.0   Median : 89.0   Median : 46.00   Median : 4.200   Median : 4.84                    Median :45.00                Median :66.00     
      Mean   :202.0   Mean   :15978   Mean   :207.8   Mean   :106.7   Mean   : 50.45   Mean   : 4.522   Mean   : 5.59                    Mean   :46.85                Mean   :66.02     
      3rd Qu.:302.5   3rd Qu.:20336   3rd Qu.:230.0   3rd Qu.:106.0   3rd Qu.: 59.00   3rd Qu.: 5.400   3rd Qu.: 5.60                    3rd Qu.:60.00                3rd Qu.:69.00     
      Max.   :403.0   Max.   :41756   Max.   :443.0   Max.   :385.0   Max.   :120.00   Max.   :19.300   Max.   :16.11                    Max.   :92.00                Max.   :76.00     
                                      NA's   :1                       NA's   :1        NA's   :1        NA's   :13                                                    NA's   :5     
          weight         frame         bp.1s           bp.1d            bp.2s           bp.2d            waist           hip           time.ppn     
      Min.   : 99.0         : 12   Min.   : 90.0   Min.   : 48.00   Min.   :110.0   Min.   : 60.00   Min.   :26.0   Min.   :30.00   Min.   :   5.0     
      1st Qu.:151.0   large :103   1st Qu.:121.2   1st Qu.: 75.00   1st Qu.:138.0   1st Qu.: 84.00   1st Qu.:33.0   1st Qu.:39.00   1st Qu.:  90.0     
      Median :172.5   medium:184   Median :136.0   Median : 82.00   Median :149.0   Median : 92.00   Median :37.0   Median :42.00   Median : 240.0     
      Mean   :177.6   small :104   Mean   :136.9   Mean   : 83.32   Mean   :152.4   Mean   : 92.52   Mean   :37.9   Mean   :43.04   Mean   : 341.2     
      3rd Qu.:200.0                3rd Qu.:146.8   3rd Qu.: 90.00   3rd Qu.:161.0   3rd Qu.:100.00   3rd Qu.:41.0   3rd Qu.:46.00   3rd Qu.: 517.5     
      Max.   :325.0                Max.   :250.0   Max.   :124.00   Max.   :238.0   Max.   :124.00   Max.   :56.0   Max.   :64.00   Max.   :1560.0     
      NA's   :1                    NA's   :5       NA's   :5        NA's   :262     NA's   :262      NA's   :2      NA's   :2       NA's   :3     
          
     si on supprime toute ligne avec NA     
     ==================================     
          
           Pnum             id             chol          stab.glu          hdl             ratio            glyhb              location       age          gender       height     
      Min.   :  3.0   Min.   : 1002   Min.   :134.0   Min.   : 48.0   Min.   : 23.00   Min.   : 2.200   Min.   : 2.850   Buckingham:50   Min.   :20.0   female:76   Min.   :52.00     
      1st Qu.:107.0   1st Qu.: 4801   1st Qu.:184.8   1st Qu.: 85.0   1st Qu.: 37.00   1st Qu.: 3.300   1st Qu.: 4.397   Louisa    :86   1st Qu.:40.0   male  :60   1st Qu.:63.00     
      Median :208.0   Median :15790   Median :212.0   Median : 94.0   Median : 46.00   Median : 4.500   Median : 4.960                   Median :51.0               Median :66.00     
      Mean   :214.6   Mean   :18260   Mean   :218.1   Mean   :115.3   Mean   : 50.62   Mean   : 4.764   Mean   : 5.942                   Mean   :51.3               Mean   :65.59     
      3rd Qu.:339.5   3rd Qu.:21327   3rd Qu.:242.2   3rd Qu.:113.5   3rd Qu.: 59.25   3rd Qu.: 5.600   3rd Qu.: 6.482                   3rd Qu.:63.0               3rd Qu.:69.00     
      Max.   :400.0   Max.   :41507   Max.   :443.0   Max.   :369.0   Max.   :120.00   Max.   :19.300   Max.   :16.110                   Max.   :89.0               Max.   :74.00     
          weight         frame        bp.1s           bp.1d            bp.2s           bp.2d            waist            hip           time.ppn     
      Min.   :102.0         : 6   Min.   :100.0   Min.   : 60.00   Min.   :110.0   Min.   : 60.00   Min.   :28.00   Min.   :33.00   Min.   :  10.0     
      1st Qu.:156.0   large :36   1st Qu.:141.5   1st Qu.: 88.75   1st Qu.:138.8   1st Qu.: 84.00   1st Qu.:35.00   1st Qu.:40.00   1st Qu.:  87.5     
      Median :179.0   medium:72   Median :150.0   Median : 94.00   Median :150.0   Median : 92.00   Median :38.00   Median :43.00   Median : 240.0     
      Mean   :184.2   small :22   Mean   :155.6   Mean   : 94.13   Mean   :152.9   Mean   : 92.52   Mean   :39.12   Mean   :44.14   Mean   : 306.5     
      3rd Qu.:210.0               3rd Qu.:166.2   3rd Qu.:100.00   3rd Qu.:161.2   3rd Qu.:100.00   3rd Qu.:43.00   3rd Qu.:47.00   3rd Qu.: 397.5     
      Max.   :290.0               Max.   :230.0   Max.   :124.00   Max.   :238.0   Max.   :124.00   Max.   :55.00   Max.   :64.00   Max.   :1440.0     
          
     avec juste les colonnes � garder     
     ================================     
          
            id            glyhb            age           gender        height          weight          waist           hip     
      Min.   : 1000   Min.   : 2.68   Min.   :19.00   female:234   Min.   :52.00   Min.   : 99.0   Min.   :26.0   Min.   :30.00     
      1st Qu.: 4792   1st Qu.: 4.38   1st Qu.:34.00   male  :169   1st Qu.:63.00   1st Qu.:151.0   1st Qu.:33.0   1st Qu.:39.00     
      Median :15766   Median : 4.84   Median :45.00                Median :66.00   Median :172.5   Median :37.0   Median :42.00     
      Mean   :15978   Mean   : 5.59   Mean   :46.85                Mean   :66.02   Mean   :177.6   Mean   :37.9   Mean   :43.04     
      3rd Qu.:20336   3rd Qu.: 5.60   3rd Qu.:60.00                3rd Qu.:69.00   3rd Qu.:200.0   3rd Qu.:41.0   3rd Qu.:46.00     
      Max.   :41756   Max.   :16.11   Max.   :92.00                Max.   :76.00   Max.   :325.0   Max.   :56.0   Max.   :64.00     
                      NA's   :13                                   NA's   :5       NA's   :1       NA's   :2      NA's   :2     
          
     si on supprime ensuite toute ligne avec NA     
     ==========================================     
          
            id            glyhb             age           gender        height          weight          waist            hip     
      Min.   : 1000   Min.   : 2.680   Min.   :19.00   female:222   Min.   :52.00   Min.   : 99.0   Min.   :26.00   Min.   :30.00     
      1st Qu.: 4792   1st Qu.: 4.390   1st Qu.:34.00   male  :160   1st Qu.:63.00   1st Qu.:151.0   1st Qu.:33.00   1st Qu.:39.00     
      Median :15770   Median : 4.840   Median :45.00                Median :66.00   Median :173.5   Median :37.00   Median :42.00     
      Mean   :15979   Mean   : 5.577   Mean   :46.85                Mean   :65.99   Mean   :177.7   Mean   :37.92   Mean   :43.04     
      3rd Qu.:20331   3rd Qu.: 5.600   3rd Qu.:60.00                3rd Qu.:69.00   3rd Qu.:200.0   3rd Qu.:41.00   3rd Qu.:46.00     
      Max.   :41752   Max.   :16.110   Max.   :92.00                Max.   :76.00   Max.   :325.0   Max.   :56.00   Max.   :64.00     
          
          
     au final, on travaille sur des donn�es de taille  382 8     
          
     les premi�res lignes de diabfr     
     ==============================     
          
         id glyhb age gender height weight waist hip taille     poids ceinture hanches      rach  sexe diabete     
     1 1000  4.31  46 female     62    121    29  38 157.48  54.88468    73.66   96.52 0.7631579 femme       0     
     2 1001  4.44  29 female     64    218    46  48 162.56  98.88314   116.84  121.92 0.9583333 femme       0     
     3 1002  4.64  58 female     61    256    49  57 154.94 116.11965   124.46  144.78 0.8596491 femme       0     
     4 1003  4.63  67   male     67    119    33  38 170.18  53.97749    83.82   96.52 0.8684211 homme       0     
     5 1005  7.72  64   male     68    183    44  41 172.72  83.00740   111.76  104.14 1.0731707 homme       1     
     6 1008  4.81  34   male     71    190    36  42 180.34  86.18255    91.44  106.68 0.8571429 homme       0     
          

 

 

retour gH    Retour à la page principale de   (gH)