Please find the requirement in the attached document.
Need this assignment by tuesday 9/20, 10:00pm
No copy and paste, Plagiarism results in course termination.
Due Date/Time: 9/23/2021, 11:59 PM
Total Points: 100
You will implement the K-means clustering and Fuzzy C-means
clustering from scratch using a programming language of your choice.
Follow software design principles and document (comment) your code
clearly explaining what you did and why you did what you did. In your
report, include a README that states how your code is supposed to be
run to obtain the expected results.
You will use a dataset representing ten years of clinical care at 130 US
hospitals and integrated delivery networks. It includes over 50 features
representing patient and hospital outcomes. The dataset is included in
the assignment with the filename diabetic_data.csv.
Use the Euclidean distance to compute the distance between any two
patients in the dataset. You will run your clustering algorithms with
different combinations of variables as specified in each question.
1. K-means clustering with different numbers of clusters (30 points)
a. Run K-means on the entire dataset with the following two variables:
â€˜time_in_hospitalâ€™, and â€˜num_medicationsâ€™ with the number of clusters
K = 2. Plot your clusters using a 3D scaï¿½er plot and report (print) the
centroid locations. Based on this plot, what are your thoughts on the
b. Test with different numbers of clusters K, running from K = 2 to K = 10
using the same variables in 1a. According to the scaï¿½er plots, which
number of clusters do you think is the most appropriate? Justify your
c. Implement Dunn index (DI) cluster validity measure from scratch.
Repeat the experiments in problem 1b and compute the corresponding
Which one do you believe is the best number of clusters according to
Dunn indices? Does this agree with your initial observation in problem
2. K-means clustering with different variables and sample size (30
a. Based on the best number of clusters you obtained in problems 1c and
the two variables, does adding the â€˜insulinâ€™ variable (total 3 variables)
improve clustering results for any 30 patients randomly selected? Use
scaï¿½er plots or any other equivalent method to justify your response.
b. Based on the model in problem 2a, does adding the â€˜diabetesMedâ€™
and â€˜changeâ€™ variables (total five variables) improve the clustering
results for the same 30 patients? Plot the results and compute the Dunn
index to justify your response.
c. Randomly sample 50,000 observations and 10,000 observations from
the entire dataset and re-run 2a and 2b for each sample size. Plot the
clustering results and compute the Dunn index for each sample size and
compare the results with 50,000 and 10,000 observations vs the entire
dataset. Justify what you observe.
d. (Bonus): What happens to the relative positioning of the centroids as
you sample fewer observations