# Simulating Fake Data in R

This blog post is on simulating fake data. I'm interested in creating synthetic versions of real datasets. For example if the data is too sensitive to be shared, or we only have summary statistics available (for example tables from a published research paper).

If we want to mimic an existing dataset, it is desirable to

- Make sure that the simulated variables have the proper data type and comparable distribution of values and
- correlations between the variables in the real dataset are taken into account.

In addition, it would be nice if such functionality is available in a standard R package. After reviewing several R packages that can simulate data, I picked the simstudy package as most promising to explore in more detail. `simstudy`

is created by **Keith Goldfeld** from New York University.

In this blog post, I explain how `simstudy`

is able to generate correlated variables, having either continuous or binary values. Along the way, we learn about fancy statistical slang such as copula's and tetrachoric correlations. It turns out there is a close connection with psychometrics, which we'll briefly discuss.

Let's start with correlated continuous variables.

```
# Loading required packages
library(simstudy)
library(data.table)
library(ggplot2)
```

# Simstudy in action

Now let's see how `simstudy`

helps us generating this type of simulated data. Simstudy works with "definition tables" that allow us to specify, for each variable, which distribution and parameters to use, as well as the desired correlations between the variables.

After specifing a definition table, we can call one of its workhorse functions `genCorFlex()`

to generate the data.

N.b. Simstudy uses different parameters for the Gamma distribution, compared to R's `rgamma()`

function. Under water, it uses the `gammaGetShapeRate()`

to transform the "mean" and "variance/ dispersion" to the more conventional "shape" and "rate" parameters.

```
set.seed(123)
corr <- 0.7
corr.mat <- matrix(c(1, corr,
corr, 1), nrow = 2)
# check that gamma parameters correspond to same shape and rate pars as used above
#simstudy::gammaGetShapeRate(mean = 4, dispersion = 0.25)
def <- defData(varname = "X1_GAM",
formula = 4, variance = 0.25, dist = "gamma")
def <- defData(def, varname = "X2_NORM",
formula = 10, variance = 2, dist = "normal")
dt <- genCorFlex(1e4, def, corMatrix = corr.mat)
cor(dt[,-"id"])
```

```
## X1_GAM X2_NORM
## X1_GAM 1.0000000 0.6823006
## X2_NORM 0.6823006 1.0000000
```

```
ggplot(dt, aes(x = X1_GAM)) +
geom_histogram(boundary = 0) +
geom_vline(xintercept = 4, col = "red")
```

`## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.`

# Relation to psychometrics

So what has psychometrics to do with all this simulation of correlated binary vector stuff?

Well, psychometrics is all about theorizing about unobserved, latent, imaginary "constructs", such as **attitude**, **general intelligence** or a **personality trait**. To measure these constructs, questionnaires are used. The questions are called **items**.

Now imagine a situation where we are interested in a particular construct, say **general intelligence**, and we design two questions to measure (hope to learn more about) the construct. Furthermore, assume that one question is more difficult than the other question. The answers to both questions can either be wrong or right.

We can model this by assuming that the (imaginary) variable "intelligence" of each respondent is located on a two-dimensional plane, with the distribution of the respondents determined by a bivariate normal distribution. Dividing this plane into four quadrants then gives us the measurable answers (right or wrong) to both questions. Learning the answers to both questions then gives us an approximate location of a respondent on our "intelligence" plane!

## Phi, tetrachoric correlation and the psych package

Officially, the Pearson correlation between two binary vectors is called the Phi coefficient. This name was actually chosen by Karl Pearson himself.

The **psych** packages contains a set of convenient functions for calculating Phi coefficients from empirical two by two tables (of two binary vectors), and finding the corresponding Pearson coefficient for the 2d (latent) normal. This coefficient is called the **tetrachoric correlation**. Again a fine archaic slang word for again a basic concept.

`library(psych)`

```
##
## Attaching package: 'psych'
```

```
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
```

```
# convert simulated binary vectors B1 and B2 to 2x2 table
twobytwo <- table(df$B1, df$B2)/nrow(df)
phi(twobytwo, digits = 6)
```

`## [1] 0.099574`

`cor(df$B1, df$B2)`

`## [1] 0.09957392`

`# both give the same result`

We can use **phi2tetra** to find the tetrachoric correlation that corresponds to the combination of a "Phi coefficient", i.e. the correlation between the two binary vectors, as well as their marginals. This is a wrapper that builds the two by two frequency table and then calls `tetrachoric()`

. This in turn uses `optimize`

(Maximum Likelihood method?) to find the tetrachoric correlation.

`phi2tetra(0.1, c(0.2, 0.8))`

`## [1] 0.2217801`

```
# compare with EP method
simstudy:::.findRhoBin(0.2, 0.8, 0.1)
```

`## [1] 0.2218018`

Comparing with the Emrich and Piedmonte method, we find that they give identical answers. Great, case closed!

# Simstudy in action II

Now that we feel confident in our methods and assumptions, let's see `simstudy`

in action.

Let's generate two binary variables, that have marginals of 20% and 80% respectively, and a Pearson correlation coefficient of 0.1.

```
set.seed(123)
corr <- 0.1
corr.mat <- matrix(c(1, corr,
corr, 1), nrow = 2)
res <- simstudy::genCorGen(10000, nvars = 2,
params1 = c(0.2, 0.8),
corMatrix = corr.mat,
dist = "binary",
method = "ep", wide = TRUE)
# let's check the result
cor(res[, -c("id")])
```

```
## V1 V2
## V1 1.00000000 0.09682531
## V2 0.09682531 1.00000000
```

Awesome, it worked!

# Conclusion

Recall, my motivation for simulating fake data with particular variable types and correlation structure is to mimic real datasets.

So are we there yet? Well, we made some progress. We now can handle correlated continuous data, as well as correlated binary data.

But we need to solve two more problems:

To simulate a particular dataset, we still need to determine for each variable its data type (binary or continuous), and if it's continuous, what is the most appropriate probability distribution (Normal, Gamma, Log-normal, etc).

we haven't properly solved correlation between dissimilar data types, e.g. a correlation between a continuous and a binary variable.

Judging from the literature (Amatya & Demirtas 2016) and packages such as `SimMultiCorrData`

by Allison Fialkowski, these are both solved, and I only need to learn about them! So, to be continued.