Exercice de révision de la séance 1
http://forge.info.univ-angers.fr/~gh/wstat/Introduction_R/revision1.php
Essayer de lire les données DIABETE pour en faire un data.frame avec comme seules colonnes id glyhb age gender height weight waist hip que l'on viendra convertir en unités métriques. Il s'agit du jeu de données diabete extrait des datasets du professeur F. Harrell fournies pour son livre Regression Modeling Strategies. On se méfiera des nombreuses données manquantes.
On construira ensuite une variable "diabète diagnostiqué" suite à la lecture de l'abstract pubmed/9258308. (extrait encadré ci-dessous). Pour cet exercice, on pourra donc ignorer les variables chol, stab.hlu, hdl, ratio, location, frame, bp.1s, bp.1d et time.ppn puis ajouter une variable rapport ceinture/hanches nommée rach.
Remarque : si vous n'arrivez pas à lire le fichier Excel, vous pouvez vous rabattre sur le fichier diabete.dar qui a les mêmes données manquantes, repérées par NA.
Prevalence of coronary heart disease risk factors among rural blacks: a community-based study (diabetes Dataset). These data are courtesy of Dr John Schorling, Department of Medicine, University of Virginia School of Medicine. The data consist of 19 variables on 403 subjects from 1046 subjects who were interviewed in a study to understand the prevalence of obesity, diabetes, and other cardiovascular risk factors in central Virginia for African Americans. According to Dr John Hong, Diabetes Mellitus Type II (adult onset diabetes) is associated most strongly with obesity. The waist/hip ratio may be a predictor in diabetes and heart disease. DM II is also agssociated with hypertension - they may both be part of "Syndrome X". The 403 subjects were the ones who were actually screened for diabetes. Glycosolated hemoglobin > 7.0 is usually taken as a positive diagnosis of diabetes. Background. Coronary heart disease (CHD) remains the most common cause of death among blacks, and the difference in CHD mortality between blacks and whites is growing. This trend may be due in part to higher rates of CHD risk factors among blacks. This study was done to determine the prevalence of CHD risk factors among a population-based sample of 403 rural blacks in Virginia. Methods. Community-based screening evaluations included the determination of exercise and smoking habits, blood pressure, height, weight, total and high-density lipoprotein (HDL) cholesterol, and glycosylated hemoglobin. (C) 1997 Southern Medical AssociationIl faudra peut être utiliser les fonctions suivantes
read.xls() du package gdata summary() du package base <- du package base [ du package base get() du package base na.omit() du package stats attach() du package base detach() du package base ifelse() du package base levels() du package base Solution
http://forge.info.univ-angers.fr/~gh/wstat/Introduction_R/revision1.php?sol=1
# chargement des fonctions (gH) source("http://forge.info.univ-angers.fr/~gh/wstat/statgh.r", encoding="latin1" ) # lecture des données Excel library(gdata) diabete <- read.xls("http://forge.info.univ-angers.fr/~gh/wstat/Introduction_R/diabete.xls") # à la fac, gdata n'est pas installé, il faut donc écrire : # install.packages("gdata",loc="d:/") # avant de pouvoir éxécuter # library(gdata) ############################################################################################# # # si gdata ne fonctionne pas, on peut utiliser XLConnect (mais avec un fichier local) # # library(XLConnect) # wb <- loadWorkbook("diabete.xls") # diabete <- readWorksheet(wb,"diabetes") # # au pire, on peut utiliser les données en .DAR : # # diabete <- lit.dar("http://forge.info.univ-angers.fr/~gh/wstat/Introduction_R/diabete.dar") # ############################################################################################# # regardons un résumé des données cats("résumé des données") print(summary(diabete)) # na.omit() appliqué à toutes les données # est un peu trop brutal cats("si on supprime toute ligne avec NA") d1 <- na.omit(diabete) print(summary(d1)) # gardons juste les colonnes qui nous intéressent cats("avec juste les colonnes à garder") d2 <- diabete[ , c("id","glyhb","age","gender","height","weight","waist","hip") ] print(summary(d2)) # et maintenant on enlève les données manquantes cats("si on supprime ensuite toute ligne avec NA") d3 <- na.omit(d2) print(summary(d3)) # apprenons au passage les diverses façon d'affecter assign("diab",d3) "<-"(diab,d3) diab <- get("d3") diab <- d3 cat("\n\nau final, on travaille sur des données de taille ",dim(diab),"\n") # construisons les variables françaises # conversion de height (pouces) et weight (pounds) en taille (cm) et poids (kg) # et conversion de waist, hip (pouces) en ceinture et hanches (cm) attach(diab) taille <- height * 2.54 poids <- weight * 0.45359237 ceinture <- waist * 2.54 hanches <- hip * 2.54 # # si on veut créer des colonnes dans le dataframe : # # diab <- transform(diab,taille=height * 2.54, poids=weight * 0.45359237, etc... # # calcul du rapport ceinture/hanches nommé rach rach <- ceinture/hanches # diagnostic de diabète à partir de la variable glyhb # codage : 0=sans diabète, 1=avec diabète diabete <- ifelse(glyhb<=7,0,1) # on traduit la variable gender en sexe sexe <- gender levels(sexe) <- c("femme","homme") detach(diab) # on ajoute ces variables au "dataframe" diabfr <- cbind(diab, taille, poids, ceinture, hanches, rach, sexe, diabete ) # voici le début des données cats("les premières lignes de diabfr") print(head(diabfr))résumé des données ================== Pnum id chol stab.glu hdl ratio glyhb location age gender height Min. : 1.0 Min. : 1000 Min. : 78.0 Min. : 48.0 Min. : 12.00 Min. : 1.500 Min. : 2.68 Buckingham:200 Min. :19.00 female:234 Min. :52.00 1st Qu.:101.5 1st Qu.: 4792 1st Qu.:179.0 1st Qu.: 81.0 1st Qu.: 38.00 1st Qu.: 3.200 1st Qu.: 4.38 Louisa :203 1st Qu.:34.00 male :169 1st Qu.:63.00 Median :202.0 Median :15766 Median :204.0 Median : 89.0 Median : 46.00 Median : 4.200 Median : 4.84 Median :45.00 Median :66.00 Mean :202.0 Mean :15978 Mean :207.8 Mean :106.7 Mean : 50.45 Mean : 4.522 Mean : 5.59 Mean :46.85 Mean :66.02 3rd Qu.:302.5 3rd Qu.:20336 3rd Qu.:230.0 3rd Qu.:106.0 3rd Qu.: 59.00 3rd Qu.: 5.400 3rd Qu.: 5.60 3rd Qu.:60.00 3rd Qu.:69.00 Max. :403.0 Max. :41756 Max. :443.0 Max. :385.0 Max. :120.00 Max. :19.300 Max. :16.11 Max. :92.00 Max. :76.00 NA's :1 NA's :1 NA's :1 NA's :13 NA's :5 weight frame bp.1s bp.1d bp.2s bp.2d waist hip time.ppn Min. : 99.0 : 12 Min. : 90.0 Min. : 48.00 Min. :110.0 Min. : 60.00 Min. :26.0 Min. :30.00 Min. : 5.0 1st Qu.:151.0 large :103 1st Qu.:121.2 1st Qu.: 75.00 1st Qu.:138.0 1st Qu.: 84.00 1st Qu.:33.0 1st Qu.:39.00 1st Qu.: 90.0 Median :172.5 medium:184 Median :136.0 Median : 82.00 Median :149.0 Median : 92.00 Median :37.0 Median :42.00 Median : 240.0 Mean :177.6 small :104 Mean :136.9 Mean : 83.32 Mean :152.4 Mean : 92.52 Mean :37.9 Mean :43.04 Mean : 341.2 3rd Qu.:200.0 3rd Qu.:146.8 3rd Qu.: 90.00 3rd Qu.:161.0 3rd Qu.:100.00 3rd Qu.:41.0 3rd Qu.:46.00 3rd Qu.: 517.5 Max. :325.0 Max. :250.0 Max. :124.00 Max. :238.0 Max. :124.00 Max. :56.0 Max. :64.00 Max. :1560.0 NA's :1 NA's :5 NA's :5 NA's :262 NA's :262 NA's :2 NA's :2 NA's :3 si on supprime toute ligne avec NA ================================== Pnum id chol stab.glu hdl ratio glyhb location age gender height Min. : 3.0 Min. : 1002 Min. :134.0 Min. : 48.0 Min. : 23.00 Min. : 2.200 Min. : 2.850 Buckingham:50 Min. :20.0 female:76 Min. :52.00 1st Qu.:107.0 1st Qu.: 4801 1st Qu.:184.8 1st Qu.: 85.0 1st Qu.: 37.00 1st Qu.: 3.300 1st Qu.: 4.397 Louisa :86 1st Qu.:40.0 male :60 1st Qu.:63.00 Median :208.0 Median :15790 Median :212.0 Median : 94.0 Median : 46.00 Median : 4.500 Median : 4.960 Median :51.0 Median :66.00 Mean :214.6 Mean :18260 Mean :218.1 Mean :115.3 Mean : 50.62 Mean : 4.764 Mean : 5.942 Mean :51.3 Mean :65.59 3rd Qu.:339.5 3rd Qu.:21327 3rd Qu.:242.2 3rd Qu.:113.5 3rd Qu.: 59.25 3rd Qu.: 5.600 3rd Qu.: 6.482 3rd Qu.:63.0 3rd Qu.:69.00 Max. :400.0 Max. :41507 Max. :443.0 Max. :369.0 Max. :120.00 Max. :19.300 Max. :16.110 Max. :89.0 Max. :74.00 weight frame bp.1s bp.1d bp.2s bp.2d waist hip time.ppn Min. :102.0 : 6 Min. :100.0 Min. : 60.00 Min. :110.0 Min. : 60.00 Min. :28.00 Min. :33.00 Min. : 10.0 1st Qu.:156.0 large :36 1st Qu.:141.5 1st Qu.: 88.75 1st Qu.:138.8 1st Qu.: 84.00 1st Qu.:35.00 1st Qu.:40.00 1st Qu.: 87.5 Median :179.0 medium:72 Median :150.0 Median : 94.00 Median :150.0 Median : 92.00 Median :38.00 Median :43.00 Median : 240.0 Mean :184.2 small :22 Mean :155.6 Mean : 94.13 Mean :152.9 Mean : 92.52 Mean :39.12 Mean :44.14 Mean : 306.5 3rd Qu.:210.0 3rd Qu.:166.2 3rd Qu.:100.00 3rd Qu.:161.2 3rd Qu.:100.00 3rd Qu.:43.00 3rd Qu.:47.00 3rd Qu.: 397.5 Max. :290.0 Max. :230.0 Max. :124.00 Max. :238.0 Max. :124.00 Max. :55.00 Max. :64.00 Max. :1440.0 avec juste les colonnes à garder ================================ id glyhb age gender height weight waist hip Min. : 1000 Min. : 2.68 Min. :19.00 female:234 Min. :52.00 Min. : 99.0 Min. :26.0 Min. :30.00 1st Qu.: 4792 1st Qu.: 4.38 1st Qu.:34.00 male :169 1st Qu.:63.00 1st Qu.:151.0 1st Qu.:33.0 1st Qu.:39.00 Median :15766 Median : 4.84 Median :45.00 Median :66.00 Median :172.5 Median :37.0 Median :42.00 Mean :15978 Mean : 5.59 Mean :46.85 Mean :66.02 Mean :177.6 Mean :37.9 Mean :43.04 3rd Qu.:20336 3rd Qu.: 5.60 3rd Qu.:60.00 3rd Qu.:69.00 3rd Qu.:200.0 3rd Qu.:41.0 3rd Qu.:46.00 Max. :41756 Max. :16.11 Max. :92.00 Max. :76.00 Max. :325.0 Max. :56.0 Max. :64.00 NA's :13 NA's :5 NA's :1 NA's :2 NA's :2 si on supprime ensuite toute ligne avec NA ========================================== id glyhb age gender height weight waist hip Min. : 1000 Min. : 2.680 Min. :19.00 female:222 Min. :52.00 Min. : 99.0 Min. :26.00 Min. :30.00 1st Qu.: 4792 1st Qu.: 4.390 1st Qu.:34.00 male :160 1st Qu.:63.00 1st Qu.:151.0 1st Qu.:33.00 1st Qu.:39.00 Median :15770 Median : 4.840 Median :45.00 Median :66.00 Median :173.5 Median :37.00 Median :42.00 Mean :15979 Mean : 5.577 Mean :46.85 Mean :65.99 Mean :177.7 Mean :37.92 Mean :43.04 3rd Qu.:20331 3rd Qu.: 5.600 3rd Qu.:60.00 3rd Qu.:69.00 3rd Qu.:200.0 3rd Qu.:41.00 3rd Qu.:46.00 Max. :41752 Max. :16.110 Max. :92.00 Max. :76.00 Max. :325.0 Max. :56.00 Max. :64.00 au final, on travaille sur des données de taille 382 8 les premières lignes de diabfr ============================== id glyhb age gender height weight waist hip taille poids ceinture hanches rach sexe diabete 1 1000 4.31 46 female 62 121 29 38 157.48 54.88468 73.66 96.52 0.7631579 femme 0 2 1001 4.44 29 female 64 218 46 48 162.56 98.88314 116.84 121.92 0.9583333 femme 0 3 1002 4.64 58 female 61 256 49 57 154.94 116.11965 124.46 144.78 0.8596491 femme 0 4 1003 4.63 67 male 67 119 33 38 170.18 53.97749 83.82 96.52 0.8684211 homme 0 5 1005 7.72 64 male 68 183 44 41 172.72 83.00740 111.76 104.14 1.0731707 homme 1 6 1008 4.81 34 male 71 190 36 42 180.34 86.18255 91.44 106.68 0.8571429 homme 0
Retour à la page principale de (gH)