Please find the requirement in the attached document.

Need this assignment by tuesday 9/20, 10:00pm

No copy and paste, Plagiarism results in course termination.

Assignment 1

Due Date/Time: 9/23/2021, 11:59 PM

Total Points: 100

You will implement the K-means clustering and Fuzzy C-means

clustering from scratch using a programming language of your choice.

Follow software design principles and document (comment) your code

clearly explaining what you did and why you did what you did. In your

report, include a README that states how your code is supposed to be

run to obtain the expected results.

You will use a dataset representing ten years of clinical care at 130 US

hospitals and integrated delivery networks. It includes over 50 features

representing patient and hospital outcomes. The dataset is included in

the assignment with the filename diabetic_data.csv.

Use the Euclidean distance to compute the distance between any two

patients in the dataset. You will run your clustering algorithms with

different combinations of variables as specified in each question.

1. K-means clustering with different numbers of clusters (30 points)

a. Run K-means on the entire dataset with the following two variables:

â€˜time_in_hospitalâ€™, and â€˜num_medicationsâ€™ with the number of clusters

K = 2. Plot your clusters using a 3D scaï¿½er plot and report (print) the

centroid locations. Based on this plot, what are your thoughts on the

generated clusters?

b. Test with different numbers of clusters K, running from K = 2 to K = 10

using the same variables in 1a. According to the scaï¿½er plots, which

number of clusters do you think is the most appropriate? Justify your

response.

c. Implement Dunn index (DI) cluster validity measure from scratch.

Repeat the experiments in problem 1b and compute the corresponding

DI indices.

Which one do you believe is the best number of clusters according to

Dunn indices? Does this agree with your initial observation in problem

1b?

2. K-means clustering with different variables and sample size (30

points)

a. Based on the best number of clusters you obtained in problems 1c and

the two variables, does adding the â€˜insulinâ€™ variable (total 3 variables)

improve clustering results for any 30 patients randomly selected? Use

scaï¿½er plots or any other equivalent method to justify your response.

b. Based on the model in problem 2a, does adding the â€˜diabetesMedâ€™

and â€˜changeâ€™ variables (total five variables) improve the clustering

results for the same 30 patients? Plot the results and compute the Dunn

index to justify your response.

c. Randomly sample 50,000 observations and 10,000 observations from

the entire dataset and re-run 2a and 2b for each sample size. Plot the

clustering results and compute the Dunn index for each sample size and

compare the results with 50,000 and 10,000 observations vs the entire

dataset. Justify what you observe.

d. (Bonus): What happens to the relative positioning of the centroids as

you sample fewer observations