Survival Analysis using Cox Regression on Breast Cancer Data
MetadataShow full item record
In this report, survival data from a german breast cancer study has been analysed using the programming software R. For the 686 female patients participating in the study, the value of eight explanatory variables were recorded at the start of the study. These variables were age, menopause status, whether the patient received tamoxifen or not, tumor grade, tumor size, number of positive lymph nodes, and amount of progesterone and estrogen bound to proteins in the cytosol of the primary tumor. Both time to recurrence of tumor and time to death were recorded for each patient. The focus has been to ﬁnd out how important the explanatory variables are when it comes to time to recurrence and time to death. After creating Kaplan-Meier curves and doing log-rank tests, the data were analysed using Cox regression. The method of purposeful selection was used to choose which of the explanatory variables that should be included in the Cox regression model. Schoenfeld residuals plots were used to identify wheter or not the assumption of proportional hazards has been obeyed. Martingale residuals plots were used to detect the functional form that should be used for the explanatory variable values in the models. After performing purposeful selection, size, grade, nodes and progesterone were the variables that remained for time to death. For time to recurrence, tamoxifen, grade, nodes and progesterone were the ones that remained. An attempt to model recurrence as a time-dependent variable was made for time to death, and it was found that people experiencing recurrence has a much higher chance of death than those not experiencing recurrence. Weibull distributed survival times were simulated by assuming the value of three explanatory variables (normally, exponentially and uniformly distributed) and their associated regression coeﬃcients. A data frame of the simulated survival data were created, and Cox regression were runned on this data frame to check if the assumed regression coeﬃcients were reproduced. The 95 % conﬁdence intervals for the regression coeﬃcients produced by the Cox regression machinery were found to include the assumed regression coeﬃcient values. It was found that increasing the standard deviation of the normally distributed explanatory variable increased the accuracy of the regression coeﬃcient estimates. Increasing the number of simulations was also found to increase the accuracy of the estimates. Survival data which had non-proportional hazards were simulated by an inbuilt R-function called sim.survdata. These data were used to test whether or not the Schoenfeld residuals plot could detect the assumed functional form of a time-dependent regression coeﬃcient. From the plot it was possible to detect that the assumed functional form had the graph of a parabola.
Master's thesis in Mathematics