Data Insight

Column

Predicting Sleep Disorders from Health Indicators

This project examines sleep disorders using data from Sleep Health and Lifestyle dataset to better understand what health factors are associated with sleep disorders. The dataset contains 374 observations and includes variables such as gender, sleep duration, and physical activity level. The response variable classifies individuals into three categories: no sleep disorder, insomnia, and sleep apnea. The primary goal of this study is to evaluate how well a multinomial logistic regression model can classify sleep disorder type based on the available predictors. In addition, the analysis investigates which predictors contribute most strongly to the model, how the model performs on unseen data, and whether certain variables predict one sleep disorder more than the other. Exploratory data analysis, model fitting, and validation using a train/test split were conducted to identify key patterns and evaluate predictive accuracy. The results suggest that a multinomial logistic regression can acturately predict a patients sleep disorder from these predictors. It also shows that the most important variables in the model were bmi, sleep duration and quality, stress and heart rate. These findings contribute to understanding how behavioral and physiological factors influence sleep health.

Background/Significance

As a college student, I am well aware of the importance of sleep. Sleep is a fundamental component of human health. Rising stress levels, demanding work schedules, and a decrease in overall physical health has contributed to widespread sleep problems in society. Understanding the factors that influence quality is essential for identifying qualities that may put people at risk for sleep disorders.

This dataset provides an opportunity to explore the relationships between physical/mental health and sleep. We will investigate how different aspects of daily life contribute to sleep behaviors. By analyzing the data, we can create a regression model to predict what kind of sleep disorder they have based on health indicators.

Column

Research Questions

Main Question

How well can a multinomial logistic regression model classify whether a patient has no sleep disorder, insomnia, or sleep apnea based on demographic, lifestyle, and physiological variables?

Within that question we will also look at

Which variables contribute most strongly to predicting sleep disorder type in the model?
How well does the model perform at predicting sleep disorder type on unseen data?
Are certain variables more indicative of one sleep disorder as compared to another? (Ex: High stress is a bigger flag for insomnia than sleep apnea)

Get to know the data

I conducted an exploratory data analysis of a Sleep Disorder Diagnosis dataset. The dataset contains 374 observations and 14 variables, collecting information about sleep, lifestyle, and health.

EDA

Column

Methods

The regression model I chose for this project was a multinomial logistic regression model. I wanted to see if a model could predict if a person had a sleep disorder from the variables in the dataset and multinomial logistic regression model is the perfect model for that goal. It predicts the probability of an outcome that can fall into more than two categorical classes and the class with the highest predicted probability is selected as the model’s final classification. In this case, the model is predicting if they have a sleep disorder. In R, I used the nnet package for the function to fit the model. I used the caret package to split the data into training and testing samples. For this project, I decided to use a 70% training and 30% testing split. I also experimented with 80% training, however I wanted a larger testing group so the results have a higher sample size. To see if the model is sufficient we used goodness of fit tests and pseudo R-squared values. We also test performance on new data. Using the 30% testing dataset, we calculate many metrics that will factor into selecting the best model. I used the tidyverse and ggplot2 package to create the graphs and the DT package to create the data tables.

Cleaning Data

I got the dataset from Kaggle. It was simple to clean. There were no NA values in the dataset and I just had to change gender, bmi_category, and sleep_disorder to factors. I also split blood_pressure into two separate categories, bp_systolic and bp_diastolic. These two variables contain just the number (ex: 120/80 split into 120 and 80). In the variable bmi_category it had a normal category and a normal weight category. I changed this to be just one category. I also simplified all the column names using the janitor package.

Column

Diagnosed Sleep Disorders

Distribution of the types of diagnoses in the dataset. None is the most common with 219 occurrences. Sleep Apnea has 78 counts and Insomnia has 77.

Sleep Duration

This is a histogram to show the distribution of number of hours slept per day by the patients. This graph shows that sleep duration is symmetric and multimodal. The distribution centers around 7.25 hours of sleep. The data ranges from 5.75 to 8.5 hours of sleep. There appears to be clusters in this data, 4 distinct groups can be made from looking at the histogram.

Sleep Duration by Sleep Disorder

As expected, people without a sleep disorder have a higher median than those with insomnia or sleep apnea. People with sleep apnea have a higher median than people with insomnia. Insomnia has some outliers above and below the mean and none has one outlier below the mean. These boxplots show a relationship between sleep duration and sleep disorder.

Stress Level

Stress level is a subjective stress rating from 1 (low) to 10 (high). This graph shows the distribution of it. 3 is the most common response with 71. It is also noteworthy that no one reported extremely low stress (1 or 2) or extremely high stress (9 or 10).

Stress Level by Sleep Disorder

People without a sleep disorder have a lower median than insomnia and sleep apnea. Insomnia and sleep apnea both have a median of 7. This boxplot signals to a relationship between stress level and sleep disorder.

Physical Activity by Sleep Disorder

Interestingly, people with sleep apnea have a higher physical activity than those without a sleep disorder. People with insomnia have the lowest median. There are outliers above the mean in the insomnia group and below the mean in the insomnia and sleep apnea group. This boxplot shows that there is a relationship between physical activity and sleep disorder.

Heart Rate by Sleep Disorder

Boxplot shows the distribution of heart rate by sleep disorder categories. People without a sleep disorder have a lower median heart rate than those with sleep apnea and insomnia. Patients with insomnia have the highest median heart rate. For all categories there seem to be some outliers above the mean.

Sleep Disorder by BMI

This stacked bar chart is very telling for our data. It shows that a large majority of people without a sleep disorder have a normal bmi. The majority in both sleep apnea and insomnia is overweight. There are also no obese people in the data that do not have a sleep disorder. There is a clear relationship between bmi and sleep disorder.

Occupation by Sleep Disorder

This stacked barchart shows occupations stacked with sleep disorders. From these bars it is obvious that each job has unique proportions. It is hard to determine from this graph if the variable will be useful in the model.

Gender

This is a pie chart of the genders of the patients. There is an approximately even amount of males and females in this study.

Pairwise Plot

From this plot we can see there is a linear relationship between sleep duration and physical activity, heart rate and daily steps, and heart rate and sleep duration. These relationships could indicate multicollinearity.

Multinomial Logistic Model

Column

First Model

Fit First Model

First, we split the data into a training and testing (70/30) sample. We are using a seed number so that the random split will be the same every time. The table shows the distribution of sleep disorders in both training and testing.

Then, we fit a multinomial logistic regression predicting sleep_disorder from every other variable in the data except person_id using the training data.

Goodness of Fit

The full model significantly improved fit over the null model, the p-value for the goodness of fit test is practically 0. The McFadden R-squared value is extremely high at 0.730. Anything above 0.4 is considered a very strong fit. This indicates that the predictors explain a large portion of the variation in sleep disorder classification.

# A tibble: 1 × 4
  Test                               G2    df  p_value
  <chr>                           <dbl> <dbl>    <dbl>
1 Likelihood Ratio (Null vs Full)  371.    44 3.11e-53

fitting null model for pseudo-r2

 McFadden 
0.7296034

Interpret Coefficients

Log-Odds

A positive coefficient shows that the predictor increases the log-odds of being in that category versus the reference category. For example, an increase in age leads to an increase in the log-odds of a person having sleep apnea. A negative coefficient is the opposite, genderFemale has a negative coefficient for insomnia. That means that being a female decreases the log-odds of having insomnia.

# A tibble: 46 × 6
   y.level  term                            estimate std.error statistic p.value
   <chr>    <chr>                              <dbl>     <dbl>     <dbl>   <dbl>
 1 Insomnia (Intercept)                    -609.      2.65e- 3  -2.30e 5   0    
 2 Insomnia genderFemale                      1.27    1.99e- 2   6.39e 1   0    
 3 Insomnia age                              -0.0955  6.98e- 2  -1.37e 0   0.171
 4 Insomnia occupationDoctor               -141.      6.70e- 3  -2.10e 4   0    
 5 Insomnia occupationEngineer              -70.0     2.62e- 2  -2.67e 3   0    
 6 Insomnia occupationLawyer               -122.      2.18e- 2  -5.60e 3   0    
 7 Insomnia occupationManager               -81.8     1.36e-16  -6.00e17   0    
 8 Insomnia occupationNurse                -192.      1.79e- 2  -1.07e 4   0    
 9 Insomnia occupationSales Representative -334.      1.10e-16  -3.04e18   0    
10 Insomnia occupationSalesperson          -135.      2.76e- 2  -4.88e 3   0    
# ℹ 36 more rows

Performance

Confusion Matrix

The model classifies all three sleep disorder categories with high accuracy. For the None class, it correctly identifies 64 individuals, misclassifying only 3 as insomnia and none as sleep apnea. For insomnia, it correctly predicts 19 cases, with only 1 being mistaken for sleep apnea. For Sleep Apnea, it correctly identifies 22 individuals, with just 2 total misclassifications. Overall, the model shows strong performance with very few errors, and is particularly good at distinguishing between no sleep disorder and sleep apnea.

Model Metrics

The model shows high accuracy of 0.9189 and a strong Kappa value of 0.8589, indicating excellent agreement between predicted and actual sleep disorder classifications beyond chance. The accuracy confidence interval 0.8517–0.9623, suggests the model performs consistently well across many samples. The McNemar test P-value of 0.0719 shows no strong evidence of prediction bias between specific classes.

The model performs very strongly across all three classes, with each showing high sensitivity, specificity, and balanced accuracy. The None class is predicted most accurately, F1 = 0.97 and balanced accuracy = 0.96, indicating excellent identification of individuals without a sleep disorder. Insomnia shows slightly lower sensitivity of 0.83 but very high precision, 0.95, meaning the model rarely mislabels other classes as insomnia. Sleep Apnea has strong performance overall with F1 = 0.94 and balanced accuracy = 0.97, reflecting reliable detection and very low false-positive and false-negative rates. Overall, the model distinguishes all three classes with high consistency and minimal misclassification.

	x
Accuracy	0.9189
Kappa	0.8589
AccuracyLower	0.8517
AccuracyUpper	0.9623
AccuracyNull	0.5856
AccuracyPValue	0.0000
McnemarPValue	0.0719

	Sensitivity	Specificity	Pos Pred Value	Neg Pred Value	Precision	Recall	F1	Prevalence	Detection Rate	Detection Prevalence	Balanced Accuracy
Class: None	0.9385	0.9348	0.9531	0.9149	0.9531	0.9385	0.9457	0.5856	0.5495	0.5766	0.9366
Class: Insomnia	0.8261	0.9886	0.9500	0.9560	0.9500	0.8261	0.8837	0.2072	0.1712	0.1802	0.9074
Class: Sleep Apnea	0.9565	0.9432	0.8148	0.9881	0.8148	0.9565	0.8800	0.2072	0.1982	0.2432	0.9499

Column

Second Model

Select Variables

I wanted to select variables that were significant to both insomnia and sleep apnea in the previous model, I also want to make the model simpler to see how it compares to a more complex model. In the second model we will fit sleep_disorder by sleep_duration, quality_of_sleep, stress_level, bmi_category, heart_rate, and daily_steps. The table shows the p-values for each predictor variable in the first model.

# weights:  27 (16 variable)
initial  value 288.935032 
iter  10 value 186.629419
iter  20 value 116.362801
iter  30 value 115.755416
iter  40 value 115.669901
iter  50 value 115.667759
iter  60 value 115.667649
final  value 115.667647 
converged

Goodness of Fit

The McFadden value for the new model is 0.545. This is still a very strong McFadden value. It indicates that the predictors for the model explain a significant amount of the variation in sleep disorders.

From the P-value table, most predictors have significant p-values for both insomnia and sleep apnea. The only non-significant predictor was sleep_duration for insomnia. This could be due to the large spread of sleep_duration in the insomnia class, making it not very useful as a predictor.

fitting null model for pseudo-r2
# weights:  6 (2 variable)
initial  value 288.935032 
final  value 253.977346 
converged

 McFadden 
0.5445749

Interpret Coefficeints

Odds-Ratios

The odds ratios tell us how a one unit change in a predictor will effect the odds of being in a certain group. In this model, bmi_category, specifically the Obese category had a large impact on prediction. Being obese raises the odds of both insomnia (158,565) and sleep apnea (75,369) compared to someone with a normal BMI. Each increase in heart_rate lowered the odds for insomnia (0.775) and raised the odds for sleep apnea (1.339).

	(Intercept)	sleep_duration	quality_of_sleep	stress_level	bmi_categoryOverweight	bmi_categoryObese	heart_rate	daily_steps
Insomnia	141260818	1.009233	0.6215461	1.4710520	64.09216	158564.59	0.7750977	0.9995273
Sleep Apnea	0	5.470877	0.3202015	0.5042633	90.74462	75369.28	1.3394414	1.0007605

Performance

Confusion Matrix

The second model shows strong overall performance in predicting sleep disorder categories. For no sleep disorder, it correctly predicts 61 individuals, misclassifying only 4 cases, 3 as Insomnia and 1 as Sleep Apnea. For insomnia, it correctly identifies 19 cases, with just 2 misclassified as no sleep disorder. For sleep apnea, the model accurately predicts 22 individuals, with only 3 misclassifications, 2 as no sleep disorder and 1 as insomnia. These results indicate that the model performs well across all classes.

Model Metrics

The model achieves high accuracy (0.9189) with a strong Kappa score (0.8581), showing that its predictions align very well with the true sleep disorder categories beyond what would occur by chance. The accuracy interval 0.8517–0.9623 indicates stable performance. The McNemar P-value of 0.6746 suggests that the model is not biased to any particular class when making mistakes.

The model performs well across all three categories. For individuals without a sleep disorder, sensitivity and precision are both high, 0.94, meaning the model correctly identifies most of these cases and rarely mislabels others as having no disorder. Its balanced accuracy of 0.93 reflects strong overall performance. For insomnia, sensitivity is lower, 0.83, but precision is high, 0.90, and the balanced accuracy is 0.90. For sleep apnea, performance is strongest, with very high sensitivity of 0.96, strong precision, 0.88, and a high F1 score of 0.92. The balanced accuracy, 0.96, shows the model identifies sleep apnea very reliably. Overall, the model performs best on sleep apnea and none.

	x
Accuracy	0.9189
Kappa	0.8581
AccuracyLower	0.8517
AccuracyUpper	0.9623
AccuracyNull	0.5856
AccuracyPValue	0.0000
McnemarPValue	0.6746

	Sensitivity	Specificity	Pos Pred Value	Neg Pred Value	Precision	Recall	F1	Prevalence	Detection Rate	Detection Prevalence	Balanced Accuracy
Class: None	0.9385	0.9130	0.9385	0.9130	0.9385	0.9385	0.9385	0.5856	0.5495	0.5856	0.9258
Class: Insomnia	0.8261	0.9773	0.9048	0.9556	0.9048	0.8261	0.8636	0.2072	0.1712	0.1892	0.9017
Class: Sleep Apnea	0.9565	0.9659	0.8800	0.9884	0.8800	0.9565	0.9167	0.2072	0.1982	0.2252	0.9612

Discussion

Column

Results

Best Model

The two models perform nearly identically. Their overall metrics (accuracy, kappa, the accuracy confidence interval and accuracy p-value) are all the exact same. They have slightly different McNemar p-values however they are both insignificant. The first model performs slightly better for predicting none and insomnia (comparing blanced accuracy) and the second model performs better predicting sleep apnea. The first model has a much higher McFadden R-squared value than the second (0.730 vs 0.545). The two models are very similar and yet one is very complex with 11 predictors and the other has just 6. Both models are very accurate so there is no wrong answer in selecting which one. For the sake of this report, I will select the first model as the better of the two. It has a higher pseudo R-squared value and performs better on two out of the three classes in sleep_disorder.

Insomnia vs None

Several predictors show strong associations with the odds of being classified with insomnia rather than having no sleep disorder. Being female increases the odds of insomnia by more than threefold, and higher stress levels also raise the odds (1.47), suggesting that emotional or psychological strain is an important factor. In contrast, both sleep duration and sleep quality are strongly protective, longer sleep and higher sleep quality sharply reduce the odds of insomnia, with very small odds ratios indicating that poor sleep habits are central to insomnia risk. BMI also plays a major role. Being obese is associated with extremely high odds of insomnia relative to a healthy BMI, making it one of the most influential predictors in the model. Physiological measures point in the same direction, higher heart rate and lower physical activity both increase the likelihood of insomnia. The strongest signals for insomnia involve lifestyle quality (sleep duration, sleep quality, stress) and health status (BMI, heart rate), indicating that both behavioral and physiological factors contribute substantially to insomnia risk.

Sleep Apnea vs None

A different pattern emerges when comparing sleep apnea to none. Being female dramatically lowers the odds of sleep apnea (OR near zero), suggesting a strong gender effect favoring males. Older age increases the odds (1.53), consistent with known clinical patterns. Several occupations, such as Doctor, Engineer, Scientist, and Sales Representative, show extremely high odds ratios, indicating much higher likelihood of sleep apnea relative to the baseline occupation, though such large values may partly reflect small sample sizes. Among lifestyle factors, sleep duration stands out most strongly: each additional hour of sleep increases the odds of sleep apnea more than fiftyfold (57), highlighting a meaningful link between long sleep duration and apnea. Sleep quality also sharply increases the odds (80). Higher heart rate is strongly associated with sleep apnea (32), and elevated diastolic blood pressure also increases the odds. BMI again plays a major role, with obesity showing extremely high odds for sleep apnea. The biggest predictors of sleep apnea are gender, age, BMI, sleep duration, and cardiovascular health, with clear patterns separating sleep apnea from no sleep disorder.

Assumptions

Check Assumptions

Before interpreting the multinomial logistic regression model, I checked two important assumptions. First was independent observations, each observation in the data was independent from any other observation. Next was severe multicollinearity, this can be indicated through high correlation between predictors or checked with VIF values. There seems to be high multicollinearity between bp_systolic and bp_diastolic since they are both derived from blood pressure, however this will not effect the models accuracy, I just need to be careful about interpreting their coefficients and odds-ratios since they may be inflated. I had to check multicollinearity using a linear model since nnet does not provide compatability with this function in the car package. However, the values are equivalent to what they would be.

                               GVIF Df GVIF^(1/(2*Df))
gender                    11.214336  1        3.348781
age                       20.098211  1        4.483103
occupation              1981.484101 10        1.461671
sleep_duration            11.249439  1        3.354018
quality_of_sleep          28.396296  1        5.328818
physical_activity_level    7.178047  1        2.679188
stress_level              24.236812  1        4.923090
bmi_category              77.900289  2        2.970878
bp_systolic              109.907974  1       10.483700
bp_diastolic             121.691867  1       11.031404
heart_rate                10.755717  1        3.279591
daily_steps                8.439698  1        2.905116

Column

Discussion and Limitations

Throughout completing this project I have learned so much. I gained familiarity with many r packages including tidyverse, nnet, caret, and DT. I also learned how to properly create and interpret a multinomial logistic regression model. One of the larger limitations is the smaller sample size of the data. I would have liked to have more observations so that the testing data could be larger. Different predictors could have effected the results however the model performed so well on the variables we already have. In the future, I would like to do a prediction using different methods such as a neural network or random forest classifier.

References

Sources

Sleep Disorder Diagnosis Dataset. Kaggle, https://www.kaggle.com/datasets/mdsultanulislamovi/sleep-disorder-diagnosis-dataset/data
Chen, Tessa. Multinomial Logistic Regression (MLR): Concepts & Application.
Kuhn, Max, and Jed Wing. The caret Package. topepo.github.io/caret/index.html.
“flexdashboard Manual.” RStudio R-Universe, https://rstudio.r-universe.dev/flexdashboard/doc/manual.html

AI Usage

AI was used in this project to help create color themes, format the dashboard, and debug some code.

About Me

Column

Who Am I?

Bio

Hello my name is Greg Pologruto. I am a Honors Computer Science student at the University of Dayton with minors in Data Analytics and Mathematics. I am on track to graduate in May of 2027.

I have experience working for FedEx as a information technology intern. During my time there, I worked with big data (millions of packages) to create a dashboard that included many key performance indicators that were critical to the business and to my agile team.

I have skills in python, R and PowerBI. I have relevant projects in data visualization, data science, and machine learning. My goal is to acquire a data science internship in the technology field this summer.

Contact Me

pologrutog1@udayton.edu | LinkedIn | GitHub

Column

Picture of Me

Me (left) and my roomate Logan showing our Christmas spirit.

---
title: "Sleep Disorder Prediction"
author: "Greg Pologruto"
output: 
  flexdashboard::flex_dashboard:
    theme:
      version: 4
      bootswatch: lumen
    orientation: columns
    vertical_layout: fill
    source_code: embed
---

<style>
.chart-title {  /* chart_title  */
   font-size: 16px;
  }
body{ /* Normal  */
      font-size: 16px;
  }
</style>


```{r setup, include=FALSE}
library(flexdashboard)
library(tidyverse)

# global theme and colors (This code is from chatgpt)
theme_set(theme_minimal(base_size = 14))
okabe <- c("#0072B2", "#E69F00", "#F0E442",
           "#009E73", "#56B4E9", "#D55E00", "#CC79A7")

scale_fill_discrete  <- function(...) scale_fill_manual(values = okabe)
scale_color_discrete <- function(...) scale_color_manual(values = okabe)
```

Data Insight
===

Column {data-width=500}
---

### Predicting Sleep Disorders from Health Indicators 

This project examines sleep disorders using data from *Sleep Health and Lifestyle* dataset to better understand what health factors are associated with sleep disorders. The dataset contains 374 observations and includes variables such as gender, sleep duration, and physical activity level. The response variable classifies individuals into three categories: no sleep disorder, insomnia, and sleep apnea. The primary goal of this study is to evaluate how well a multinomial logistic regression model can classify sleep disorder type based on the available predictors. In addition, the analysis investigates which predictors contribute most strongly to the model, how the model performs on unseen data, and whether certain variables predict one sleep disorder more than the other. Exploratory data analysis, model fitting, and validation using a train/test split were conducted to identify key patterns and evaluate predictive accuracy. The results suggest that a multinomial logistic regression can acturately predict a patients sleep disorder from these predictors. It also shows that the most important variables in the model were bmi, sleep duration and quality, stress and heart rate. These findings contribute to understanding how behavioral and physiological factors influence sleep health.

### Background/Significance

As a college student, I am well aware of the importance of sleep. Sleep is a fundamental component of human health. Rising stress levels, demanding work schedules, and a decrease in overall physical health has contributed to widespread sleep problems in society. Understanding the factors that influence quality is essential for identifying qualities that may put people at risk for sleep disorders.

This dataset provides an opportunity to explore the relationships between physical/mental health and sleep. We will investigate how different aspects of daily life contribute to sleep behaviors. By analyzing the data, we can create a regression model to predict what kind of sleep disorder they have based on health indicators.



Column {.tabset data-width=500}
---

### Research Questions

#### Main Question

  * How well can a multinomial logistic regression model classify whether a patient has no sleep disorder, insomnia, or sleep apnea based on demographic, lifestyle, and physiological variables?
  
Within that question we will also look at

  * Which variables contribute most strongly to predicting sleep disorder type in the model?
  * How well does the model perform at predicting sleep disorder type on unseen data?
  * Are certain variables more indicative of one sleep disorder as compared to another? (Ex: High stress is a bigger flag for insomnia than sleep apnea)

### Get to know the data

I conducted an exploratory data analysis of a Sleep Disorder Diagnosis dataset. The dataset contains 374 observations and 14 variables, collecting information about sleep, lifestyle, and health.  

```{r package_data, warning= F, fig.height=7}
pacman::p_load(knitr, tidyverse, readr, janitor,DT)

sleep <- read_csv("Sleep_health_and_lifestyle_dataset.csv")

# rename
sleep <- clean_names(sleep)



# change variable types
sleep <- sleep |>
  mutate(
    gender = factor(gender, levels = c("Male", "Female")),
    bmi_category = case_when(
      bmi_category == "Normal Weight" ~ "Normal",
      TRUE ~ bmi_category
    ),
    bmi_category = factor(bmi_category, levels=c("Normal","Overweight","Obese")),
    sleep_disorder = factor(sleep_disorder, levels=c("None", "Insomnia", "Sleep Apnea")),
    occupation = factor(occupation)
  )|>
  # split blood_pressure into two separate numbers (120/80 -> s: 120, d: 80)
  separate(blood_pressure, into = c("bp_systolic", "bp_diastolic"),
           sep = "/", remove = TRUE, convert = TRUE)

datatable(sleep, class = 'cell-border stripe')
```


EDA
===

Column {data-width=500}
---

### Methods

The regression model I chose for this project was a multinomial logistic regression model. I wanted to see if a model could predict if a person had a sleep disorder from the variables in the dataset and multinomial logistic regression model is the perfect model for that goal. It predicts the probability of an outcome that can fall into more than two categorical classes and the class with the highest predicted probability is selected as the model’s final classification. In this case, the model is predicting if they have a sleep disorder. In R, I used the nnet package for the function to fit the model. I used the caret package to split the data into training and testing samples. For this project, I decided to use a 70% training and 30% testing split. I also experimented with 80% training, however I wanted a larger testing group so the results have a higher sample size. To see if the model is sufficient we used goodness of fit tests and pseudo R-squared values. We also test performance on new data. Using the 30% testing dataset, we calculate many metrics that will factor into selecting the best model. I used the tidyverse and ggplot2 package to create the graphs and the DT package to create the data tables.  

### Cleaning Data

I got the dataset from [Kaggle](https://www.kaggle.com/datasets/mdsultanulislamovi/sleep-disorder-diagnosis-dataset/data). It was simple to clean. There were no NA values in the dataset and I just had to change gender, bmi_category, and sleep_disorder to factors. I also split blood_pressure into two separate categories, bp_systolic and bp_diastolic. These two variables contain just the number (ex: 120/80 split into 120 and 80). In the variable bmi_category it had a normal category and a normal weight category. I changed this to be just one category. I also simplified all the column names using the janitor package.


Column {.tabset data-width=500}
---

### Diagnosed Sleep Disorders
 
```{r disorder_bar, fig.cap="Distribution of the types of diagnoses in the dataset. None is the most common with 219 occurrences. Sleep Apnea has 78 counts and Insomnia has 77."}
ggplot(sleep,
       aes(x=sleep_disorder))+
  geom_bar(color="black", fill="#0072B2")+
  labs(
    title="Distribution of Diagnoses",
    x = "Sleep Disorder"
  )+
  geom_text(stat="count",
            aes(label = after_stat(count)),
            vjust = -0.4)+
  theme_minimal()+
  theme(legend.position ="none")
```


### Sleep Duration
 
```{r sleep_duration_hist, fig.cap="This is a histogram to show the distribution of number of hours slept per day by the patients. This graph shows that sleep duration is symmetric and multimodal. The distribution centers around 7.25 hours of sleep. The data ranges from 5.75 to 8.5 hours of sleep. There appears to be clusters in this data, 4 distinct groups can be made from looking at the histogram."}
ggplot(sleep,
       aes(x=sleep_duration))+
  geom_histogram(fill = "#0072B2", color="black")+
  labs(
    title="Distribution of Sleep Duration",
    x="Sleep Duration (Hours)"
  )+
  scale_x_continuous(breaks = seq(5, 10, by = 0.25))+
  theme_minimal()
```

### Sleep Duration by Sleep Disorder

```{r sleep_duration_by_disorder, fig.cap="As expected, people without a sleep disorder have a higher median than those with insomnia or sleep apnea. People with sleep apnea have a higher median than people with insomnia. Insomnia has some outliers above and below the mean and none has one outlier below the mean. These boxplots show a relationship between sleep duration and sleep disorder."}
ggplot(sleep,
       aes(y=sleep_duration, x=sleep_disorder, fill = sleep_disorder))+
  geom_boxplot(color="black")+
  labs(
    title = "Sleep Duration Boxplot by Sleep Disorder",
    x = "Sleep Duisorder",
    y = "Sleep Duration"
  )+
  theme_minimal()+
  theme(legend.position = "none")
```

### Stress Level
 
```{r stress_level_bar, fig.cap="Stress level is a subjective stress rating from 1 (low) to 10 (high). This graph shows the distribution of it. 3 is the most common response with 71. It is also noteworthy that no one reported extremely low stress (1 or 2) or extremely high stress (9 or 10)."}
ggplot(sleep,
       aes(x=as.factor(stress_level)))+
  geom_bar(color="black",fill="#0072B2")+
  labs(
    title="Distribution of Stress Levels",
    x= "Stress Levels (1-10)"
  )+
  geom_text(stat="count",
            aes(label = after_stat(count)),
            vjust = -0.4)+
  theme_minimal()+
  theme(legend.position = "none")
```

### Stress Level by Sleep Disorder

```{r stress_level_by_sleep_duration, fig.cap="People without a sleep disorder have a lower median than insomnia and sleep apnea. Insomnia and sleep apnea both have a median of 7. This boxplot signals to a relationship between stress level and sleep disorder."}
ggplot(sleep,
       aes(y=stress_level, x=sleep_disorder, fill = sleep_disorder))+
  geom_boxplot(color="black")+
  labs(
    title = "Stress Level Boxplot by Sleep Disorder",
    x = "Sleep Disorder",
    y = "Stress Level"
  )+
  theme_minimal()+
  theme(legend.position = "none")
```

### Physical Activity by Sleep Disorder
```{r physical_activity_sleep_disorder, fig.cap="Interestingly, people with sleep apnea have a higher physical activity than those without a sleep disorder. People with insomnia have the lowest median. There are outliers above the mean in the insomnia group and below the mean in the insomnia and sleep apnea group. This boxplot shows that there is a relationship between physical activity and sleep disorder."}
ggplot(sleep,
       aes(x=sleep_disorder, y=physical_activity_level, fill=sleep_disorder))+
  geom_boxplot(color="black")+
  labs(
    title = "Physical Activity Boxplot by Sleep Disorder",
    x = "Sleep Disorder",
    y = "Physical Activity"
    )+
  theme_minimal()+
  theme(legend.position = "none")
```

### Heart Rate by Sleep Disorder

```{r heart_rate_sleep_disorder, fig.cap="Boxplot shows the distribution of heart rate by sleep disorder categories. People without a sleep disorder have a lower median heart rate than those with sleep apnea and insomnia. Patients with insomnia have the highest median heart rate. For all categories there seem to be some outliers above the mean."}
ggplot(sleep,
       aes(x=sleep_disorder, y=heart_rate, fill=sleep_disorder))+
  geom_boxplot(color="black")+
  labs(
    title = "Heart Rate Boxplot by Sleep Disorder",
    x = "Sleep Disorder",
    y = "Heart Rate"
    )+
  theme_minimal()+
  theme(legend.position = "none")
```

### Sleep Disorder by BMI

```{r bmi_sleep_disorder, fig.cap= "This stacked bar chart is very telling for our data. It shows that a large majority of people without a sleep disorder have a normal bmi. The majority in both sleep apnea and insomnia is overweight. There are also no obese people in the data that do not have a sleep disorder. There is a clear relationship between bmi and sleep disorder."}
ggplot(sleep,
       aes(x = sleep_disorder, fill = bmi_category))+
  geom_bar(position = "fill", color="black") + 
  labs(
    title = "Sleep Disorder Barchart by BMI Category",
    x = "Sleep Disorder",
    y = "Proportion"
    ) +
  theme_minimal()
```

### Occupation by Sleep Disorder

```{r occupation_sleep_disorder, fig.width=8, fig.cap="This stacked barchart shows occupations stacked with sleep disorders. From these bars it is obvious that each job has unique proportions. It is hard to determine from this graph if the variable will be useful in the model."}
# To display properly split into two graphs
occ_levels <- levels(sleep$occupation)

first6  <- occ_levels[1:6]
last5   <- occ_levels[7:11]

sleep1 <- sleep |> filter(occupation %in% first6)
sleep2 <- sleep |> filter(occupation %in% last5)

p1 <- ggplot(sleep1,
       aes(x = occupation, fill = sleep_disorder))+
  geom_bar(position = "fill", color="black") + 
  labs(
    title = "Occupation Barchart by Sleep Disorder",
    x = "Occupation",
    y = "Proportion"
    ) +
  theme_minimal()

p2 <- ggplot(sleep2,
       aes(x = occupation, fill = sleep_disorder))+
  geom_bar(position = "fill", color="black") + 
  labs(
    x = "Occupation",
    y = "Proportion"
    ) +
  theme_minimal()+
  theme(legend.position = "none")

library(patchwork)

p1/p2
```

### Gender

```{r gender_pie, fig.cap="This is a pie chart of the genders of the patients. There is an approximately even amount of males and females in this study."}
gender_count <- count(sleep, gender)
gender_count$percent <- round(gender_count$n/sum(gender_count$n)*100,2)

ggplot(gender_count,
              aes(x="", y=percent, fill=gender))+
  geom_bar(stat='identity', width =1, color='black')+
  coord_polar("y",start=0)+
  geom_text(aes(label=paste0(percent,"%")),
            fontface="bold",
            color="black",
            position = position_stack(vjust=0.7))+
  labs(title="Pie Chart of Gender")+
  scale_fill_manual(values = c("Female" = "pink", "Male" = "lightblue"))+
  theme_void()
```

### Pairwise Plot

```{r pairwise_plot, fig.cap="From this plot we can see there is a linear relationship between sleep duration and physical activity, heart rate and daily steps, and heart rate and sleep duration. These relationships could indicate multicollinearity."}
pairs(sleep[,c("age","sleep_duration","physical_activity_level","heart_rate","daily_steps")],
      col = sleep$sleep_disorder)
```

Multinomial Logistic Model
===

Column {.tabset data-width=500}
---

### First Model

#### Fit First Model

First, we split the data into a training and testing (70/30) sample. We are using a seed number so that the random split will be the same every time. The table shows the distribution of sleep disorders in both training and testing. 

Then, we fit a multinomial logistic regression predicting sleep_disorder from every other variable in the data except person_id using the training data.

```{r split_data, fig.height=5}
pacman::p_load(caret, nnet, pROC, pscl, MASS, broom)
df <- sleep[,-c(1)] # remove person_id
set.seed(2025)
idx <- createDataPartition(df$sleep_disorder, p=0.7, list=F)
train <- df[idx,]
test <- df[-idx,]

train_tab <- as.data.frame(table(train$sleep_disorder))
test_tab  <- as.data.frame(table(test$sleep_disorder))
train_tab$Dataset <- "Train"
test_tab$Dataset  <- "Test"
colnames(train_tab) <- c("Sleep Disorder", "Count","Dataset")
colnames(test_tab)  <- c("Sleep Disorder", "Count","Dataset")

datatable(
  rbind(train_tab,test_tab),
  caption = "Sleep Disorder Distribution in Train and Test Sets",
  class = "cell-border stripe"
)
```

```{r}
model_all <- multinom(sleep_disorder ~ ., data= train, trace=F)
```

### Goodness of Fit

#### Goodness of Fit

The full model significantly improved fit over the null model, the p-value for the goodness of fit test is practically 0. The McFadden R-squared value is extremely high at 0.730. Anything above 0.4 is considered a very strong fit. This indicates that the predictors explain a large portion of the variation in sleep disorder classification.

```{r good_of_fit}
null_model <- multinom(sleep_disorder ~ 1, data = train, trace = F)

LL_full <- logLik(model_all); LL_null <- logLik(null_model)
G2 <- -2 * (as.numeric(LL_null) - as.numeric(LL_full))
df_1 <- attr(LL_full, "df") - attr(LL_null, "df")
p_value <- pchisq(G2, df = df_1, lower.tail = FALSE)

tibble(
  Test = "Likelihood Ratio (Null vs Full)",
  G2 = G2, df = df_1, `p_value` = p_value
)

pR2(model_all)[4] # print mcfadden value
```

### Interpret Coefficients

#### Log-Odds

A positive coefficient shows that the predictor increases the log-odds of being in that category versus the reference category. For example, an increase in age leads to an increase in the log-odds of a person having sleep apnea. A negative coefficient is the opposite, genderFemale has a negative coefficient for insomnia. That means that being a female decreases the log-odds of having insomnia.

```{r log_odds}
tidy(model_all)
```

### Performance

#### Confusion Matrix

The model classifies all three sleep disorder categories with high accuracy. For the None class, it correctly identifies 64 individuals, misclassifying only 3 as insomnia and none as sleep apnea. For insomnia, it correctly predicts 19 cases, with only 1 being mistaken for sleep apnea. For Sleep Apnea, it correctly identifies 22 individuals, with just 2 total misclassifications. Overall, the model shows strong performance with very few errors, and is particularly good at distinguishing between no sleep disorder and sleep apnea.

```{r cm}
pred_class <- predict(model_all, newdata = test)
pred_prob  <- predict(model_all, newdata = test, type = "prob")

cm <- confusionMatrix(
  data = factor(pred_class, levels = levels(test$sleep_disorder)),
  reference = test$sleep_disorder
)



cm_df <- as.data.frame(cm$table)
colnames(cm_df) <- c("Predicted", "Actual", "Freq")

ggplot(cm_df, aes(Actual, Predicted, fill = Freq)) +
  geom_tile(color = "black") +
  geom_text(aes(label = Freq), size = 5, fontface = "bold") +
  scale_fill_gradient(low = "white", high = "darkblue") +
  labs(title = "Confusion Matrix Heatmap for Model 1",
       x = "Actual Class",
       y = "Predicted Class") +
  theme_minimal()
```

#### Model Metrics

The model shows high accuracy of 0.9189 and a strong Kappa value of 0.8589, indicating excellent agreement between predicted and actual sleep disorder classifications beyond chance. The accuracy confidence interval 0.8517–0.9623, suggests the model performs consistently well across many samples. The McNemar test P-value of 0.0719 shows no strong evidence of prediction bias between specific classes.

The model performs very strongly across all three classes, with each showing high sensitivity, specificity, and balanced accuracy. The None class is predicted most accurately, F1 = 0.97 and balanced accuracy = 0.96, indicating excellent identification of individuals without a sleep disorder. Insomnia shows slightly lower sensitivity of 0.83 but very high precision, 0.95, meaning the model rarely mislabels other classes as insomnia. Sleep Apnea has strong performance overall with F1 = 0.94 and balanced accuracy = 0.97, reflecting reliable detection and very low false-positive and false-negative rates. Overall, the model distinguishes all three classes with high consistency and minimal misclassification.

```{r metrics}
kable(round(cm$overall,4))
kable(round(cm$byClass,4))
```


Column {.tabset data-width=500}
---

### Second Model

#### Select Variables

I wanted to select variables that were significant to both insomnia and sleep apnea in the previous model, I also want to make the model simpler to see how it compares to a more complex model. In the second model we will fit sleep_disorder by sleep_duration, quality_of_sleep, stress_level, bmi_category, heart_rate, and daily_steps. The table shows the p-values for each predictor variable in the first model.

```{r select_variables}
s <- summary(model_all)
z <- s$coefficients / s$standard.errors
p <- 2 * (1 - pnorm(abs(z)))
datatable(round(p, 4),caption="P-Values for Model 1")
model2 <- multinom(sleep_disorder~ sleep_duration + quality_of_sleep + stress_level + bmi_category + heart_rate + daily_steps, data=train, traxce = F)
```

### Goodness of Fit

#### Goodness of Fit

The McFadden value for the new model is 0.545. This is still a very strong McFadden value. It indicates that the predictors for the model explain a significant amount of the variation in sleep disorders.

From the P-value table, most predictors have significant p-values for both insomnia and sleep apnea. The only non-significant predictor was sleep_duration for insomnia. This could be due to the large spread of sleep_duration in the insomnia class, making it not very useful as a predictor. 

```{r gof2}
pR2(model2)["McFadden"]
s <- summary(model2)
z <- s$coefficients / s$standard.errors
p <- 2 * (1 - pnorm(abs(z)))
datatable(round(p, 4), caption = "P-Values for Model 2")
```

### Interpret Coefficeints

#### Odds-Ratios

The odds ratios tell us how a one unit change in a predictor will effect the odds of being in a certain group. In this model, bmi_category, specifically the Obese category had a large impact on prediction. Being obese raises the odds of both insomnia (158,565) and sleep apnea (75,369) compared to someone with a normal BMI. Each increase in heart_rate lowered the odds for insomnia (0.775) and raised the odds for sleep apnea (1.339).

```{r or}
kable(exp(coef(model2)))
```

### Performance

#### Confusion Matrix

The second model shows strong overall performance in predicting sleep disorder categories. For no sleep disorder, it correctly predicts 61 individuals, misclassifying only 4 cases, 3 as Insomnia and 1 as Sleep Apnea. For insomnia, it correctly identifies 19 cases, with just 2 misclassified as no sleep disorder. For sleep apnea, the model accurately predicts 22 individuals, with only 3 misclassifications, 2 as no sleep disorder and 1 as insomnia. These results indicate that the model performs well across all classes.

```{r cm2}
pred_class2 <- predict(model2, newdata = test)
pred_prob2  <- predict(model2, newdata = test, type = "prob")

cm2 <- confusionMatrix(
  data = factor(pred_class2, levels = levels(test$sleep_disorder)),
  reference = test$sleep_disorder
)

cm_df2 <- as.data.frame(cm2$table)
colnames(cm_df2) <- c("Predicted", "Actual", "Freq")

ggplot(cm_df2, aes(Actual, Predicted, fill = Freq)) +
  geom_tile(color = "black") +
  geom_text(aes(label = Freq), size = 5, fontface = "bold") +
  scale_fill_gradient(low = "white", high = "darkblue") +
  labs(title = "Confusion Matrix Heatmap Model 2",
       x = "Actual Class",
       y = "Predicted Class") +
  theme_minimal()
```

#### Model Metrics

The model achieves high accuracy (0.9189) with a strong Kappa score (0.8581), showing that its predictions align very well with the true sleep disorder categories beyond what would occur by chance. The accuracy interval 0.8517–0.9623 indicates stable performance. The McNemar P-value of 0.6746 suggests that the model is not biased to any particular class when making mistakes.

The model performs well across all three categories. For individuals without a sleep disorder, sensitivity and precision are both high, 0.94, meaning the model correctly identifies most of these cases and rarely mislabels others as having no disorder. Its balanced accuracy of 0.93 reflects strong overall performance. For insomnia, sensitivity is lower, 0.83, but precision is high, 0.90, and the balanced accuracy is 0.90. For sleep apnea, performance is strongest, with very high sensitivity of 0.96, strong precision, 0.88, and a high F1 score of 0.92. The balanced accuracy, 0.96, shows the model identifies sleep apnea very reliably. Overall, the model performs best on sleep apnea and none.

```{r mm2}
kable(round(cm2$overall,4))
kable(round(cm2$byClass,4))
```

Discussion
===

Column {.tabset data-width=500}
---

### Results

#### Best Model

The two models perform nearly identically. Their overall metrics (accuracy, kappa, the accuracy confidence interval and accuracy p-value) are all the exact same. They have slightly different McNemar p-values however they are both insignificant. The first model performs slightly better for predicting none and insomnia (comparing blanced accuracy) and the second model performs better predicting sleep apnea. The first model has a much higher McFadden R-squared value than the second (0.730 vs 0.545). The two models are very similar and yet one is very complex with 11 predictors and the other has just 6. Both models are very accurate so there is no wrong answer in selecting which one. For the sake of this report, I will select the first model as the better of the two. It has a higher pseudo R-squared value and performs better on two out of the three classes in sleep_disorder.

#### Insomnia vs None

Several predictors show strong associations with the odds of being classified with insomnia rather than having no sleep disorder. Being female increases the odds of insomnia by more than threefold, and higher stress levels also raise the odds (1.47), suggesting that emotional or psychological strain is an important factor. In contrast, both sleep duration and sleep quality are strongly protective, longer sleep and higher sleep quality sharply reduce the odds of insomnia, with very small odds ratios indicating that poor sleep habits are central to insomnia risk. BMI also plays a major role. Being obese is associated with extremely high odds of insomnia relative to a healthy BMI, making it one of the most influential predictors in the model. Physiological measures point in the same direction, higher heart rate and lower physical activity both increase the likelihood of insomnia. The strongest signals for insomnia involve lifestyle quality (sleep duration, sleep quality, stress) and health status (BMI, heart rate), indicating that both behavioral and physiological factors contribute substantially to insomnia risk.

#### Sleep Apnea vs None

A different pattern emerges when comparing sleep apnea to none. Being female dramatically lowers the odds of sleep apnea (OR near zero), suggesting a strong gender effect favoring males. Older age increases the odds (1.53), consistent with known clinical patterns. Several occupations, such as Doctor, Engineer, Scientist, and Sales Representative, show extremely high odds ratios, indicating much higher likelihood of sleep apnea relative to the baseline occupation, though such large values may partly reflect small sample sizes. Among lifestyle factors, sleep duration stands out most strongly: each additional hour of sleep increases the odds of sleep apnea more than fiftyfold (57), highlighting a meaningful link between long sleep duration and apnea. Sleep quality also sharply increases the odds (80). Higher heart rate is strongly associated with sleep apnea (32), and elevated diastolic blood pressure also increases the odds. BMI again plays a major role, with obesity showing extremely high odds for sleep apnea. The biggest predictors of sleep apnea are gender, age, BMI, sleep duration, and cardiovascular health, with clear patterns separating sleep apnea from no sleep disorder.

### Assumptions

#### Check Assumptions

Before interpreting the multinomial logistic regression model, I checked two important assumptions. First was independent observations, each observation in the data was independent from any other observation. Next was severe multicollinearity, this can be indicated through high correlation between predictors or checked with VIF values. There seems to be high multicollinearity between bp_systolic and bp_diastolic since they are both derived from blood pressure, however this will not effect the models accuracy, I just need to be careful about interpreting their coefficients and odds-ratios since they may be inflated. I had to check multicollinearity using a linear model since nnet does not provide compatability with this function in the car package. However, the values are equivalent to what they would be.

```{r VIF, warning=FALSE}
library(car)
vif(lm(as.numeric(sleep_disorder)~.,data=df))
```


Column {data-width=500}
---

### Discussion and Limitations

#### Discussion and Limitations

Throughout completing this project I have learned so much. I gained familiarity with many r packages including tidyverse, nnet, caret, and DT. I also learned how to properly create and interpret a multinomial logistic regression model. One of the larger limitations is the smaller sample size of the data. I would have liked to have more observations so that the testing data could be larger. Different predictors could have effected the results however the model performed so well on the variables we already have. In the future, I would like to do a prediction using different methods such as a neural network or random forest classifier.

### References

#### Sources

  1. Sleep Disorder Diagnosis Dataset. Kaggle, https://www.kaggle.com/datasets/mdsultanulislamovi/sleep-disorder-diagnosis-dataset/data
  2. Chen, Tessa. Multinomial Logistic Regression (MLR): Concepts & Application.
  3. Kuhn, Max, and Jed Wing. The caret Package. topepo.github.io/caret/index.html.
  4. “flexdashboard Manual.” RStudio R-Universe, https://rstudio.r-universe.dev/flexdashboard/doc/manual.html

#### AI Usage

AI was used in this project to help create color themes, format the dashboard, and debug some code.



About Me
===

Column {.tabset data-width=700}
---

### Who Am I?

#### Bio 

Hello my name is Greg Pologruto. I am a Honors Computer Science student at the University of Dayton with minors in Data Analytics and Mathematics. I am on track to graduate in May of 2027.

I have experience working for FedEx as a information technology intern. During my time there, I worked with big data (millions of packages) to create a dashboard that included many key performance indicators that were critical to the business and to my agile team.

I have skills in python, R and PowerBI. I have relevant projects in data visualization, data science, and machine learning. My goal is to acquire a data science internship in the technology field this summer. 


#### Contact Me

pologrutog1@udayton.edu | [LinkedIn](www.linkedin.com/in/greg-pologruto) | [GitHub](https://github.com/greg-pologruto)
  

Column {data-width=300}
---

### Picture of Me
```{r picture, echo = F, fig.cap = "Me (left) and my roomate Logan showing our Christmas spirit.", out.width = '100%'}
knitr::include_graphics("C:/Users/gregp/OneDrive/Pictures/Archive1/af9ae276-2198-4d1c-92d0-d3a65a3ca96d.jpg")
```