--- title: "Double Biased ML Model" author: "Cheney Yu" date: "2022-11-06" output: pdf_document editor_options: chunk_output_type: inline --- ```{r setup, include=FALSE} library(DoubleML) ``` The Bonus data set from the Pennsylvania Reemployment Bonus experiment and as a second example we simulate data from a partially linear regression model. ```{r } # Simulate data set.seed(3141) n_obs = 500 n_vars = 100 theta = 3 X = matrix(rnorm(n_obs*n_vars), nrow=n_obs, ncol=n_vars) d = X[,1:3]%*%c(5,5,5) + rnorm(n_obs) y = theta*d + X[, 1:3]%*%c(5,5,5) + rnorm(n_obs) ``` DoubleML provides interfaces to objects of class data.table as well as R base classes data.frame and matrix.The DoubleMLData class serves as data-backend and can be initialized from a dataframe by specifying the column y_col="inuidur1" serving as outcome variable 𝑌, the column(s) d_cols = "tg" serving as treatment variable 𝐷 and the columns x_cols=c("female", "black", "othrace", "dep1", "dep2", "q2", "q3", "q4", "q5", "q6", "agelt35", "agegt54", "durable", "lusd", "husd") specifying the confounders. Alternatively a matrix interface can be used as shown below for the simulated data. ```{r } # Specify the data and variables for the causal model dml_data_bonus = DoubleMLData$new(df_bonus, y_col = "inuidur1", d_cols = "tg", x_cols = c("female", "black", "othrace", "dep1", "dep2", "q2", "q3", "q4", "q5", "q6", "agelt35", "agegt54", "durable", "lusd", "husd")) print(dml_data_bonus) ``` ```{r} # matrix interface to DoubleMLData dml_data_sim = double_ml_data_from_matrix(X=X, y=y, d=d) dml_data_sim ``` To estimate our partially linear regression (PLR) model with the double machine learning algorithm, we first have to specify machine learners to estimate 𝑚0 and 𝑔0. For the bonus data we use a random forest regression model and for our simulated data from a sparse partially linear model we use a Lasso regression model. The implementation of DoubleML is based on the meta-packages mlr3 for R. ```{r} library(mlr3) library(mlr3learners) # surpress messages from mlr3 package during fitting lgr::get_logger("mlr3")$set_threshold("warn") learner = lrn("regr.ranger", num.trees=500, mtry=floor(sqrt(n_vars)), max.depth=5, min.node.size=2) ml_g_bonus = learner$clone() ml_m_bonus = learner$clone() learner = lrn("regr.glmnet", lambda = sqrt(log(n_vars)/(n_obs))) ml_g_sim = learner$clone() ml_m_sim = learner$clone() ``` When initializing the object for PLR models DoubleMLPLR, we can further set parameters specifying the resampling: The number of folds used for cross-fitting n_folds (defaults to n_folds = 5) as well as the number of repetitions when applying repeated cross-fitting n_rep (defaults to n_rep = 1). Additionally, one can choose between the algorithms "dml1" and "dml2" via dml_procedure (defaults to "dml2"). Depending on the causal model, one can further choose between different Neyman-orthogonal score / moment functions. ```{r} set.seed(3141) obj_dml_plr_bonus = DoubleMLPLR$new(dml_data_bonus, ml_g=ml_g_bonus, ml_m=ml_m_bonus) obj_dml_plr_bonus$fit() obj_dml_plr_bonus$summary() print(obj_dml_plr_bonus) ``` ```{r} obj_dml_plr_sim = DoubleMLPLR$new(dml_data_sim, ml_g=ml_g_sim, ml_m=ml_m_sim) ``` ```{r} obj_dml_plr_sim$fit() obj_dml_plr_sim$summary() print(obj_dml_plr_sim) ``` ## Developed by Philipp Bach, Victor Chernozhukov, Malte S. Kurz, Martin Spindler. The package in the market since 2021 hdm package has the 1991 Survey of Income and Participation data nicely wrapped up in the variable pension. Now we can map what our response, treatment and covariates are from their description in the paper and their columns in the data. Response: net_tfa a positive or negative numerical outcome that shows the net financial assets of a person. Covariates age inc: income fsize: family size educ: years of education pira: participation in a IRA hown: Home ownership status marr: Marriage status db: Defined benefit pension twoearn: Two earning household. ```{r} library(hdm) library(ggplot2) library(tidyverse) library(dplyr) treatment <- c("e401") response <- c("net_tfa") xActual <- c("age", "inc", "fsize", "educ", "pira", "hown", "marr", "db", "twoearn") data(pension) ggplot(pension, aes(x=factor(e401))) + geom_bar() + coord_flip() + ggtitle("401k Eligibility") ``` ```{r} ggplot(pension, aes(x=net_tfa/1e3)) + geom_histogram(binwidth=50) + xlab("Net Financial Assets (thousands $)") + ylab("Count") ``` ```{r} ggplot(pension %>% select(age, inc, fsize, educ, pira, hown, marr, db, twoearn) %>% gather(Covariate, Value), aes(x=Value)) + geom_histogram() + facet_wrap(~Covariate, scales="free") ``` Double machine learning is an attempt to understand the effect a treatment has on a response without being unduly influenced by the covariates. ## Step 1: Split the data ```{r} pension %>% mutate(e401F = factor(e401, levels = c(0, 1))) -> modelData inds <- sample.int(nrow(modelData), nrow(modelData)/2, replace=F) dataList <- list(modelData[inds, ], modelData[-inds, ]) ``` ```{r} library(caret) train_control <- trainControl(method="adaptive_cv", number=10, search = "random", verboseIter = TRUE) rfResponseModel <- lapply(dataList, function(x) train(net_tfa ~ . - e401 - e401F, method = "ranger", tuneLength = 10, data = x, verbose = T, trControl = train_control)) rfTreatmentModel <- lapply(dataList, function(x) train(e401F ~ . - net_tfa - e401, method="ranger", tuneLength = 10, data = x, verbose = T, trControl = train_control)) ``` ```{r} calc_theta <- function(dataList, responseModel, treatmentModel){ # Predict the response in dataset 1 (2) using model 2 (1). responsePredictions <- lapply(list(c(1,2), c(2,1)), function(i) predict(responseModel[[i[1]]], dataList[[i[2]]])) # Do the same for the treatment model treatmentPredictions <- lapply(list(c(1,2), c(2,1)), function(i) as.numeric(predict(treatmentModel[[i[1]]], dataList[[i[2]]])) - 1) # Calculate the treatment residuals treatmentResiduals <- list(dataList[[2]]$e401 - treatmentPredictions[[1]], dataList[[1]]$e401 - treatmentPredictions[[2]]) # Calculate the response residuals responseResiduals <- list(dataList[[2]]$net_tfa - responsePredictions[[1]], dataList[[1]]$net_tfa - responsePredictions[[2]]) # Regress the residuals across both datasets theta1 <- mean(treatmentResiduals[[1]] %*% responseResiduals[[1]]) / mean(treatmentResiduals[[1]] %*% dataList[[2]]$e401) theta2 <- mean(treatmentResiduals[[2]] %*% responseResiduals[[2]]) / mean(treatmentResiduals[[2]] %*% dataList[[1]]$e401) # Take the average as our treatment effect estimator mean(c(theta1, theta2)) } calc_theta(dataList, rfResponseModel, rfTreatmentModel) ``` ```