Exercice de r�vision de la s�ance 1
http://forge.info.univ-angers.fr/~gh/wstat/Introduction_R/revision1.php
Essayer de lire les donn�es DIABETE pour en faire un data.frame avec comme seules colonnes id glyhb age gender height weight waist hip que l'on viendra convertir en unit�s m�triques. Il s'agit du jeu de donn�es diabete extrait des datasets du professeur F. Harrell fournies pour son livre Regression Modeling Strategies. On se m�fiera des nombreuses donn�es manquantes.
On construira ensuite une variable "diab�te diagnostiqu�" suite � la lecture de l'abstract pubmed/9258308. (extrait encadr� ci-dessous). Pour cet exercice, on pourra donc ignorer les variables chol, stab.hlu, hdl, ratio, location, frame, bp.1s, bp.1d et time.ppn puis ajouter une variable rapport ceinture/hanches nomm�e rach.
Remarque : si vous n'arrivez pas � lire le fichier Excel, vous pouvez vous rabattre sur le fichier diabete.dar qui a les mêmes donn�es manquantes, rep�r�es par NA.
Prevalence of coronary heart disease risk factors among rural blacks: a community-based study (diabetes Dataset). These data are courtesy of Dr John Schorling, Department of Medicine, University of Virginia School of Medicine. The data consist of 19 variables on 403 subjects from 1046 subjects who were interviewed in a study to understand the prevalence of obesity, diabetes, and other cardiovascular risk factors in central Virginia for African Americans. According to Dr John Hong, Diabetes Mellitus Type II (adult onset diabetes) is associated most strongly with obesity. The waist/hip ratio may be a predictor in diabetes and heart disease. DM II is also agssociated with hypertension - they may both be part of "Syndrome X". The 403 subjects were the ones who were actually screened for diabetes. Glycosolated hemoglobin > 7.0 is usually taken as a positive diagnosis of diabetes. Background. Coronary heart disease (CHD) remains the most common cause of death among blacks, and the difference in CHD mortality between blacks and whites is growing. This trend may be due in part to higher rates of CHD risk factors among blacks. This study was done to determine the prevalence of CHD risk factors among a population-based sample of 403 rural blacks in Virginia. Methods. Community-based screening evaluations included the determination of exercise and smoking habits, blood pressure, height, weight, total and high-density lipoprotein (HDL) cholesterol, and glycosylated hemoglobin. (C) 1997 Southern Medical AssociationIl faudra peut être utiliser les fonctions suivantes
read.xls() du package gdata summary() du package base <- du package base [ du package base get() du package base na.omit() du package stats attach() du package base detach() du package base ifelse() du package base levels() du package base Solution
http://forge.info.univ-angers.fr/~gh/wstat/Introduction_R/revision1.php?sol=1
# chargement des fonctions (gH) source("http://forge.info.univ-angers.fr/~gh/wstat/statgh.r", encoding="latin1" ) # lecture des donn�es Excel library(gdata) diabete <- read.xls("http://forge.info.univ-angers.fr/~gh/wstat/Introduction_R/diabete.xls") # � la fac, gdata n'est pas install�, il faut donc �crire : # install.packages("gdata",loc="d:/") # avant de pouvoir �x�cuter # library(gdata) ############################################################################################# # # si gdata ne fonctionne pas, on peut utiliser XLConnect (mais avec un fichier local) # # library(XLConnect) # wb <- loadWorkbook("diabete.xls") # diabete <- readWorksheet(wb,"diabetes") # # au pire, on peut utiliser les donn�es en .DAR : # # diabete <- lit.dar("http://forge.info.univ-angers.fr/~gh/wstat/Introduction_R/diabete.dar") # ############################################################################################# # regardons un r�sum� des donn�es cats("r�sum� des donn�es") print(summary(diabete)) # na.omit() appliqu� � toutes les donn�es # est un peu trop brutal cats("si on supprime toute ligne avec NA") d1 <- na.omit(diabete) print(summary(d1)) # gardons juste les colonnes qui nous int�ressent cats("avec juste les colonnes � garder") d2 <- diabete[ , c("id","glyhb","age","gender","height","weight","waist","hip") ] print(summary(d2)) # et maintenant on enl�ve les donn�es manquantes cats("si on supprime ensuite toute ligne avec NA") d3 <- na.omit(d2) print(summary(d3)) # apprenons au passage les diverses fa�on d'affecter assign("diab",d3) "<-"(diab,d3) diab <- get("d3") diab <- d3 cat("\n\nau final, on travaille sur des donn�es de taille ",dim(diab),"\n") # construisons les variables fran�aises # conversion de height (pouces) et weight (pounds) en taille (cm) et poids (kg) # et conversion de waist, hip (pouces) en ceinture et hanches (cm) attach(diab) taille <- height * 2.54 poids <- weight * 0.45359237 ceinture <- waist * 2.54 hanches <- hip * 2.54 # # si on veut cr�er des colonnes dans le dataframe : # # diab <- transform(diab,taille=height * 2.54, poids=weight * 0.45359237, etc... # # calcul du rapport ceinture/hanches nomm� rach rach <- ceinture/hanches # diagnostic de diab�te � partir de la variable glyhb # codage : 0=sans diab�te, 1=avec diab�te diabete <- ifelse(glyhb<=7,0,1) # on traduit la variable gender en sexe sexe <- gender levels(sexe) <- c("femme","homme") detach(diab) # on ajoute ces variables au "dataframe" diabfr <- cbind(diab, taille, poids, ceinture, hanches, rach, sexe, diabete ) # voici le d�but des donn�es cats("les premi�res lignes de diabfr") print(head(diabfr))r�sum� des donn�es ================== Pnum id chol stab.glu hdl ratio glyhb location age gender height Min. : 1.0 Min. : 1000 Min. : 78.0 Min. : 48.0 Min. : 12.00 Min. : 1.500 Min. : 2.68 Buckingham:200 Min. :19.00 female:234 Min. :52.00 1st Qu.:101.5 1st Qu.: 4792 1st Qu.:179.0 1st Qu.: 81.0 1st Qu.: 38.00 1st Qu.: 3.200 1st Qu.: 4.38 Louisa :203 1st Qu.:34.00 male :169 1st Qu.:63.00 Median :202.0 Median :15766 Median :204.0 Median : 89.0 Median : 46.00 Median : 4.200 Median : 4.84 Median :45.00 Median :66.00 Mean :202.0 Mean :15978 Mean :207.8 Mean :106.7 Mean : 50.45 Mean : 4.522 Mean : 5.59 Mean :46.85 Mean :66.02 3rd Qu.:302.5 3rd Qu.:20336 3rd Qu.:230.0 3rd Qu.:106.0 3rd Qu.: 59.00 3rd Qu.: 5.400 3rd Qu.: 5.60 3rd Qu.:60.00 3rd Qu.:69.00 Max. :403.0 Max. :41756 Max. :443.0 Max. :385.0 Max. :120.00 Max. :19.300 Max. :16.11 Max. :92.00 Max. :76.00 NA's :1 NA's :1 NA's :1 NA's :13 NA's :5 weight frame bp.1s bp.1d bp.2s bp.2d waist hip time.ppn Min. : 99.0 : 12 Min. : 90.0 Min. : 48.00 Min. :110.0 Min. : 60.00 Min. :26.0 Min. :30.00 Min. : 5.0 1st Qu.:151.0 large :103 1st Qu.:121.2 1st Qu.: 75.00 1st Qu.:138.0 1st Qu.: 84.00 1st Qu.:33.0 1st Qu.:39.00 1st Qu.: 90.0 Median :172.5 medium:184 Median :136.0 Median : 82.00 Median :149.0 Median : 92.00 Median :37.0 Median :42.00 Median : 240.0 Mean :177.6 small :104 Mean :136.9 Mean : 83.32 Mean :152.4 Mean : 92.52 Mean :37.9 Mean :43.04 Mean : 341.2 3rd Qu.:200.0 3rd Qu.:146.8 3rd Qu.: 90.00 3rd Qu.:161.0 3rd Qu.:100.00 3rd Qu.:41.0 3rd Qu.:46.00 3rd Qu.: 517.5 Max. :325.0 Max. :250.0 Max. :124.00 Max. :238.0 Max. :124.00 Max. :56.0 Max. :64.00 Max. :1560.0 NA's :1 NA's :5 NA's :5 NA's :262 NA's :262 NA's :2 NA's :2 NA's :3 si on supprime toute ligne avec NA ================================== Pnum id chol stab.glu hdl ratio glyhb location age gender height Min. : 3.0 Min. : 1002 Min. :134.0 Min. : 48.0 Min. : 23.00 Min. : 2.200 Min. : 2.850 Buckingham:50 Min. :20.0 female:76 Min. :52.00 1st Qu.:107.0 1st Qu.: 4801 1st Qu.:184.8 1st Qu.: 85.0 1st Qu.: 37.00 1st Qu.: 3.300 1st Qu.: 4.397 Louisa :86 1st Qu.:40.0 male :60 1st Qu.:63.00 Median :208.0 Median :15790 Median :212.0 Median : 94.0 Median : 46.00 Median : 4.500 Median : 4.960 Median :51.0 Median :66.00 Mean :214.6 Mean :18260 Mean :218.1 Mean :115.3 Mean : 50.62 Mean : 4.764 Mean : 5.942 Mean :51.3 Mean :65.59 3rd Qu.:339.5 3rd Qu.:21327 3rd Qu.:242.2 3rd Qu.:113.5 3rd Qu.: 59.25 3rd Qu.: 5.600 3rd Qu.: 6.482 3rd Qu.:63.0 3rd Qu.:69.00 Max. :400.0 Max. :41507 Max. :443.0 Max. :369.0 Max. :120.00 Max. :19.300 Max. :16.110 Max. :89.0 Max. :74.00 weight frame bp.1s bp.1d bp.2s bp.2d waist hip time.ppn Min. :102.0 : 6 Min. :100.0 Min. : 60.00 Min. :110.0 Min. : 60.00 Min. :28.00 Min. :33.00 Min. : 10.0 1st Qu.:156.0 large :36 1st Qu.:141.5 1st Qu.: 88.75 1st Qu.:138.8 1st Qu.: 84.00 1st Qu.:35.00 1st Qu.:40.00 1st Qu.: 87.5 Median :179.0 medium:72 Median :150.0 Median : 94.00 Median :150.0 Median : 92.00 Median :38.00 Median :43.00 Median : 240.0 Mean :184.2 small :22 Mean :155.6 Mean : 94.13 Mean :152.9 Mean : 92.52 Mean :39.12 Mean :44.14 Mean : 306.5 3rd Qu.:210.0 3rd Qu.:166.2 3rd Qu.:100.00 3rd Qu.:161.2 3rd Qu.:100.00 3rd Qu.:43.00 3rd Qu.:47.00 3rd Qu.: 397.5 Max. :290.0 Max. :230.0 Max. :124.00 Max. :238.0 Max. :124.00 Max. :55.00 Max. :64.00 Max. :1440.0 avec juste les colonnes � garder ================================ id glyhb age gender height weight waist hip Min. : 1000 Min. : 2.68 Min. :19.00 female:234 Min. :52.00 Min. : 99.0 Min. :26.0 Min. :30.00 1st Qu.: 4792 1st Qu.: 4.38 1st Qu.:34.00 male :169 1st Qu.:63.00 1st Qu.:151.0 1st Qu.:33.0 1st Qu.:39.00 Median :15766 Median : 4.84 Median :45.00 Median :66.00 Median :172.5 Median :37.0 Median :42.00 Mean :15978 Mean : 5.59 Mean :46.85 Mean :66.02 Mean :177.6 Mean :37.9 Mean :43.04 3rd Qu.:20336 3rd Qu.: 5.60 3rd Qu.:60.00 3rd Qu.:69.00 3rd Qu.:200.0 3rd Qu.:41.0 3rd Qu.:46.00 Max. :41756 Max. :16.11 Max. :92.00 Max. :76.00 Max. :325.0 Max. :56.0 Max. :64.00 NA's :13 NA's :5 NA's :1 NA's :2 NA's :2 si on supprime ensuite toute ligne avec NA ========================================== id glyhb age gender height weight waist hip Min. : 1000 Min. : 2.680 Min. :19.00 female:222 Min. :52.00 Min. : 99.0 Min. :26.00 Min. :30.00 1st Qu.: 4792 1st Qu.: 4.390 1st Qu.:34.00 male :160 1st Qu.:63.00 1st Qu.:151.0 1st Qu.:33.00 1st Qu.:39.00 Median :15770 Median : 4.840 Median :45.00 Median :66.00 Median :173.5 Median :37.00 Median :42.00 Mean :15979 Mean : 5.577 Mean :46.85 Mean :65.99 Mean :177.7 Mean :37.92 Mean :43.04 3rd Qu.:20331 3rd Qu.: 5.600 3rd Qu.:60.00 3rd Qu.:69.00 3rd Qu.:200.0 3rd Qu.:41.00 3rd Qu.:46.00 Max. :41752 Max. :16.110 Max. :92.00 Max. :76.00 Max. :325.0 Max. :56.00 Max. :64.00 au final, on travaille sur des donn�es de taille 382 8 les premi�res lignes de diabfr ============================== id glyhb age gender height weight waist hip taille poids ceinture hanches rach sexe diabete 1 1000 4.31 46 female 62 121 29 38 157.48 54.88468 73.66 96.52 0.7631579 femme 0 2 1001 4.44 29 female 64 218 46 48 162.56 98.88314 116.84 121.92 0.9583333 femme 0 3 1002 4.64 58 female 61 256 49 57 154.94 116.11965 124.46 144.78 0.8596491 femme 0 4 1003 4.63 67 male 67 119 33 38 170.18 53.97749 83.82 96.52 0.8684211 homme 0 5 1005 7.72 64 male 68 183 44 41 172.72 83.00740 111.76 104.14 1.0731707 homme 1 6 1008 4.81 34 male 71 190 36 42 180.34 86.18255 91.44 106.68 0.8571429 homme 0
Retour à la page principale de (gH)