Introduction to Probability and StatisticsΒΆ

AssignmentΒΆ

In this assignment, we will use the dataset of diabetes patients taken from here.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

df = pd.read_csv("../../../data/diabetes.tsv",sep='\t')
df.head()

In this dataset, columns as the following:

  • Age and sex are self-explanatory

  • BMI is body mass index

  • BP is average blood pressure

  • S1 through S6 are different blood measurements

  • Y is the qualitative measure of disease progression over one year

Let’s study this dataset using methods of probability and statistics.

Task 1: Compute mean values and variance for all valuesΒΆ

df.describe()
# Another way
pd.DataFrame([df.mean(),df.var()],index=['Mean','Variance']).head()
# Or, more simply, for the mean (variance can be done similarly)
df.mean()

Task 2: Plot boxplots for BMI, BP and Y depending on genderΒΆ

for col in ['BMI','BP','Y']:
    df.boxplot(column=col,by='SEX')
plt.show()

Task 3: What is the the distribution of Age, Sex, BMI and Y variables?ΒΆ

for col in ['AGE','SEX','BMI','Y']:
    df[col].hist()
    plt.show()

Conclusions:

  • Age - normal

  • Sex - uniform

  • BMI, Y - hard to tell

Task 4: Test the correlation between different variables and disease progression (Y)ΒΆ

Hint Correlation matrix would give you the most useful information on which values are dependent.

df.corr()

Conclusion:

  • The strongest correlation of Y is BMI and S5 (blood sugar). This sounds reasonable.

fig, ax = plt.subplots(1,3,figsize=(10,5))
for i,n in enumerate(['BMI','S5','BP']):
    ax[i].scatter(df['Y'],df[n])
    ax[i].set_title(n)
plt.show()

Task 5: Test the hypothesis that the degree of diabetes progression is different between men and womenΒΆ

from scipy.stats import ttest_ind

tval, pval = ttest_ind(df.loc[df['SEX']==1,['Y']], df.loc[df['SEX']==2,['Y']],equal_var=False)
print(f"T-value = {tval[0]:.2f}\nP-value: {pval[0]}")

Conclusion: p-value close to 0 (typically, below 0.05) would indicate high confidence in our hypothesis. In our case, there is no strong evidence that sex affects progression of diabetes.