TidyFinance: Financial data in R

class: center, middle, inverse, title-slide

.title[
# TidyFinance: Financial data in R
]
.subtitle[
## www.tidy-finance.org
]
.author[
### Patrick Weiss
]
.date[
### December 5, 2022
]

---

class: chapter-slide

# Workshop outline

---
## What will be covered in the workshop?

**1. Introduction**

**2. Publicly available financial data**

**3. WRDS: CRSP, Compustat, and TRACE**

**4. Data storage for research**

**5. Other provides**

---
## What will you learn today?

**Data**
- Yahoo!Finance and French data
- Stock data from CRSP
- Accounting data from Compustat
- Bond trade data from TRACE
- Other data providers

**Tools**
- Within R data collection
- WRDS connection
- Data storage in SQLite databases
---

### The basis for this workshop: #TidyFinance is ...

.pull-left[
- ... an open-source `{bookdown}` available at [tidy-finance.org](https://www.tidy-finance.org).

- ... a step towards **reproducible finance** by providing a fully transparent code base.

- ... a resource for students, lecturers, and professionals using `R` for applications in finance.

- ... a **tidy** approach to finance.

- ... continuously maintained and expanded.
]

.pull-right[
<middle><center><img src="data:image/png;base64,#images/cover.jpg" alt="Cover image of 'Tidy Finance with R'. The figure reads Tidy Finance with R followed by the author's names; Christoph Scheuch, Stefan Voigt, and Patrick Weiss." width="400"/></center></middle>
]

---
class: middle

.pull-left[
### `R` is among the best choices for finance programming.

- Free, open-source software with a diverse, active **community**.

- Actively-maintained **packages** for all kinds of applications.

- Smooth integration with other **programming languages**.

- **RStudio** <svg aria-hidden="true" role="img" viewBox="0 0 512 512" style="height:1em;width:1em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:steelblue;overflow:visible;position:relative;"><path d="M225.8 468.2l-2.5-2.3L48.1 303.2C17.4 274.7 0 234.7 0 192.8v-3.3c0-70.4 50-130.8 119.2-144C158.6 37.9 198.9 47 231 69.6c9 6.4 17.4 13.8 25 22.3c4.2-4.8 8.7-9.2 13.5-13.3c3.7-3.2 7.5-6.2 11.5-9c0 0 0 0 0 0C313.1 47 353.4 37.9 392.8 45.4C462 58.6 512 119.1 512 189.5v3.3c0 41.9-17.4 81.9-48.1 110.4L288.7 465.9l-2.5 2.3c-8.2 7.6-19 11.9-30.2 11.9s-22-4.2-30.2-11.9zM239.1 145c-.4-.3-.7-.7-1-1.1l-17.8-20c0 0-.1-.1-.1-.1c0 0 0 0 0 0c-23.1-25.9-58-37.7-92-31.2C81.6 101.5 48 142.1 48 189.5v3.3c0 28.5 11.9 55.8 32.8 75.2L256 430.7 431.2 268c20.9-19.4 32.8-46.7 32.8-75.2v-3.3c0-47.3-33.6-88-80.1-96.9c-34-6.5-69 5.4-92 31.2c0 0 0 0-.1 .1s0 0-.1 .1l-17.8 20c-.3 .4-.7 .7-1 1.1c-4.5 4.5-10.6 7-16.9 7s-12.4-2.5-16.9-7z"/></svg>
]

.pull-right[
### `{tidyverse}` is the way for data analysis.

- Messy data cause pain. **Tidy data** cause joy.

- Compose simple functions with the **pipe**.

- Designed for **humans**.

- **Consistent** across packages.
]

---
## Today's speaker: Patrick Weiss

&nbsp;

.pull-left[
<middle><center><img style="border-radius: 50%;" alt="Portrait of Patrick Weiss" src="data:image/png;base64,#images/pic_patrick.jpg" width="200px"/></center></middle>

.center[**Patrick Weiss**

- External lecturer at Reykjavik University

- Research focus on empirical asset pricing with equities and bonds

- Published in JFE on 'The maturity premium'

- Co-author of TidyFinance
]

---
## Before we do anything else

- If you have not installation of `{tidyverse}`, run `install.packages("tidyverse")`

``` r
library(tidyverse)
```

---
class: chapter-slide

# Publicly available financial data

---
## Where to get free data?

- There are several free financial data providers:
  + Yahoo!Finance
  + Kenneth French's website
  + FED & ECB data
  
- We will discuss how to access them and manipulate the data

- There are more free alternative presented in Chapter 5 of #TidyFinance

---
## Yahoo!Finance overview

- We start by downloading and visualizing stock data from Yahoo!Finance
  + Provides stock data for several frequencies
  + Adjusted prices for dividends & splits
  + Returns have to be computed from the price time series

- For this, we also load the convenient `{tidyquant}` package

``` r
library(tidyquant)
```

---
## Apple stock data

- We retrieve stock market data for Apple (ticker AAPL) with `tq_get()`

``` r
prices <- tq_get("AAPL",
  get = "stock.prices",
  from = "2000-01-01",
  to = "2021-12-31"
)
```

- `{tidyquant}` comes with additional features, but we will skip them here (consider `tq_index("DOW")` to get a set of tickers)

---
## Apple stock price (I)

- It is generally a good idea to look at your data, e.g., by visualizing it in a plot:

``` r
prices |>
  ggplot(aes(x = date, y = adjusted)) +
  geom_line() +
  labs(
    x = NULL,
    y = NULL,
    title = "Apple stock prices between beginning of 2000 and end of 2021"
  )
```

---
## Apple stock price (II)

![Apple stock prices between beginning of 2000 and end of 2021](data:image/png;base64,#W4U_FD_slides_files/figure-html/unnamed-chunk-5-1.png)

---
## Computing Apple's stock returns

- We can now compute stock returns from our prices
  + Considering adjusted stock prices (dividends and stock splits)
  + Returns can be computed as `$r_t = p_t / p_{t-1} - 1$`
  + Where `$p_t$` is the adjusted day `$t$` price

``` r
returns <- prices |>
  arrange(date) |>
  mutate(ret = adjusted / lag(adjusted) - 1) |>
  select(symbol, date, ret) |> 
  drop_na(ret)
```

---
## Apple's stock returns: Historgram (I)

``` r
quantile_05 <- quantile(returns |> pull(ret) * 100, probs = 0.05)

returns |>
  ggplot(aes(x = ret * 100)) +
  geom_histogram(bins = 100) +
  geom_vline(aes(xintercept = quantile_05),
    linetype = "dashed"
  ) +
  labs(
    x = NULL,
    y = NULL,
    title = "Distribution of daily Apple stock returns in percent"
  )
```

---
## Apple's stock returns: Historgram (II)

![Distribution of daily Apple stock returns in percent](data:image/png;base64,#W4U_FD_slides_files/figure-html/unnamed-chunk-8-1.png)

---
## Kenneth French's website

- Kenneth French provides data on his webpage including
  + Risk-free rates
  + Risk factors returns
  + Test assets for various tests
  
- Access via the `frenchdata` package

- Check out all available data on the webpage or via `get_french_data_list()`

---
## Kenneth French's data access

``` r
library(frenchdata)
factors_ff_monthly_raw <- download_french_data("Fama/French 3 Factors")
factors_ff_monthly <- factors_ff_monthly_raw$subsets$data[[1]] |>
  transmute(
    month = floor_date(ymd(str_c(date, "01")), "month"),
    rf = as.numeric(RF) / 100,
    mkt_excess = as.numeric(`Mkt-RF`) / 100,
    smb = as.numeric(SMB) / 100,
    hml = as.numeric(HML) / 100)
```

---
## Risk-free rate

``` r
factors_ff_monthly |> ggplot(aes(x = month, y = 100 * rf)) + 
  geom_line() +  labs(x = NULL, y = NULL, title = "Risk free rate (in percent)") 
```

![Risk free rate (in percent)](data:image/png;base64,#W4U_FD_slides_files/figure-html/unnamed-chunk-10-1.png)

---
## Macroeconomic data sources

- There are tons of data available from the FED and ECB

- Check out the packages `{fredr}` and `{ecb}` for easy access

- Alternatively, Ivo Welch provides some important predictors for asset pricing research on his website
  + Use the packages `{readxl}` and `{googledrive}` to retrieve it
  + Manipulating the data as suggested in Chapter 2.3 of #TidyFinance

---
class: chapter-slide

# WRDS: CRSP, Compustat, and TRACE

---
background-image: url("data:image/png;base64,#images/wrds_logo.png")
background-position: 90% 60%
background-size: 400px

## Where to get data?

- Gathering high quality data is the basis for any for research project

- Luckily, the Wharton Research Data Service (i.e., WRDS) exists

- WRDS combines many data providers in one, easy-to-use platform
  + CRSP,
  + Compustat,
  + TRACE,
  + Optionmetrics, and so much more..
  
---
## How to retrive data from WRDS?

- You can access the WRDS-webpage and use their interface to download data

- WRDS supports many different data formats, if you prefer other software

- However, the easiest way is a remote connection via the `{RPostgres}` package
  + You could use the `{odbc}` package, but it comes without suitable drivers

``` r
library(RPostgres)
library(dbplyr)
```

- `{dbplyer}` gives us functionality to handle remote connections in a tidy manner

---
## Connecting to WRDS

``` r
wrds <- dbConnect(
  Postgres(),
  host = "wrds-pgdata.wharton.upenn.edu",
  dbname = "wrds",
  port = 9737,
  sslmode = "require",
  user = Sys.getenv("user"),
  password = Sys.getenv("password")
)
```

Note that user and password are defined as environment variables for me

---
## The US stock market

- Large parts of the academic literature focus on US stock markets

- Stocks are listed on US exchanges (NYSE, AMEX, NASDAQ, and some smaller ones)

- Extensive data on prices is provided by the Center for Research in Security Prices (CRSP), maintained by the University of Chicago, Booth School of Business

- Full sample ranges from December 1925 and is continuously updated

---
## The CRSP record

- Data processing for CRSP involves several steps:
  + Defining the sample to US stocks (i.e., `shrcd%in%c(10,11)`)
  + Add delisting returns

- The data contains
  + Stock identifier `permno`
  + Time identifier `month`
  + Price `altprc`, return `ret`, and shares outstanding `shrout`
  + listing exchanges `exchcd` and firm's industry `siccd`

---
## Monthly CRSP data: WRDS download

``` r
# Security returns
msf_db <- tbl(wrds, in_schema("crsp", "msf"))

# Identifying information
msenames_db <- tbl(wrds, in_schema("crsp", "msenames"))

# Delisting returns
msedelist_db <- tbl(wrds, in_schema("crsp", "msedelist"))
```

---
## Monthly CRSP data: Combining information

``` r
crsp_monthly <- msf_db |>
  filter(date >= start_date & date <= end_date) |>
  inner_join(msenames_db |>
               filter(shrcd %in% c(10, 11)) |>
               select(permno, exchcd, siccd, namedt, nameendt),
             by = c("permno")) |>
  filter(date >= namedt & date <= nameendt) |>
  mutate(month = floor_date(date, "month")) |>
  left_join(msedelist_db |>
              select(permno, dlstdt, dlret, dlstcd) |>
              mutate(month = floor_date(dlstdt, "month")),
            by = c("permno", "month"))
```

---
## Monthly CRSP data: Real download

``` r
crsp_monthly <- crsp_monthly |>
  select(
    permno, # Security identifier
    date, # Date of the observation
    month, # Month of the observation
    ret, # Return
    shrout, # Shares outstanding (in thousands)
    altprc, # Last traded price in a month
    exchcd, # Exchange code
    siccd, # Industry code
    dlret, # Delisting return
    dlstcd # Delisting code
  ) |>
  collect()
```

---
## Load data

- We connect to our existing database and load
  + Fama-French factors including the risk-free rate
  + CRSP data
  + CPI data, which we directly match to CRSP

``` r
crsp_monthly <- crsp_monthly |>
  mutate(month = ymd(month),
         shrout = shrout * 1000) |> 
  left_join(cpi_monthly, by = "month") |>
  left_join(factors_ff_monthly, by = "month") 
```

---
## The CRSP record overview: Exchanges

``` r
crsp_monthly |>  
  count(exchange, date) |> 
  ggplot(aes(x = date, y = n, color = exchange)) + geom_line() +
  labs(x = NULL, y = NULL, color = NULL, title = "Number of securities by exchange")
```

![Market cap by industry (in trillion USD)](data:image/png;base64,#W4U_FD_slides_files/figure-html/unnamed-chunk-19-1.png)

---
## The CRSP record overview: Industries

``` r
crsp_monthly |> 
  group_by(month, industry) |>  
  summarize(mktcap = sum(mktcap / cpi) / 1000000, .groups = 'drop') |>  
  ggplot(aes(x = month, y = mktcap, color = industry)) + geom_line() +
  labs(x = NULL, y = NULL, title = "Market cap by industry (in trillion USD)")
```

![Market cap by industry (in trillion USD)](data:image/png;base64,#W4U_FD_slides_files/figure-html/unnamed-chunk-20-1.png)

---
## Excess returns in the CRSP sample

- Excess return for a stock is the difference between the stock return and the return on the risk-free security over the same period

``` r
crsp_monthly  |>  
  group_by(month) |> 
  summarise(across(ret_excess, 
                   list(mean = mean, sd = sd, min = min, 
                        q25 = ~ quantile(., 0.25), 
                        median = median, 
                        q75 = ~ quantile(., 0.75), max = max),
                   .names = "{.fn} return")) |> 
  summarise(across(-month, mean)) 
```

---
## Compustat

- Firm accounting data are an important source of information

- Compustat provided by S&P Global Market Intelligence is the main source for this information

- For US and Canadian companies, annual history is available back to 1950 (quarterly frequency also available)

``` r
funda_db <- tbl(wrds, in_schema("comp", "funda"))
```

---
## Compustat download via WRDS

1. We get only records in industrial data format
1. In the standard format (i.e., consolidated information in standard presentation)
1. Only data in the desired time window and the required fields

``` r
compustat <- funda_db |>
  filter(indfmt == "INDL" &
      datafmt == "STD" &
      consol == "C" &
      datadate >= start_date & datadate <= end_date)
```

There is no selection of the required fields here nor a `collect()`

---
## Compustat: Ensure unique entries

``` r
compustat <- compustat |>
  mutate(year = year(datadate)) |>
  group_by(gvkey, year) |>
  filter(datadate == max(datadate)) |>
  ungroup()
```

---
## Linking CRSP and Compustat

- CRSP and Compustat use different keys to identify stocks and firms

- A curated matching table on WRDS allows us to merge CRSP and Compustat

``` r
ccmxpf_linktable_db <- tbl(
  wrds,
  in_schema("crsp", "ccmxpf_linktable")
)
```

---
## Linking CRSP and Compustat: Active links

- Not all links are of equal quality, hence, we filter the link (see #TidyFinance Chapter 3.6)

``` r
ccmxpf_linktable <- ccmxpf_linktable_db |>
  filter(linktype %in% c("LU", "LC") &
    linkprim %in% c("P", "C") &
    usedflag == 1) |>
  select(permno = lpermno, gvkey, linkdt, linkenddt) |>
  collect() |>
  mutate(linkenddt = replace_na(linkenddt, today()))
```

---
## Corporate bond data on WRDS

- Mergent Fixed Income Securities Database (FISD) is the primary resource for bond characteristics

``` r
mergent <- tbl(wrds, in_schema("fisd", "fisd_mergedissue"))
```

- The Trade Reporting and Compliance Engine (TRACE) contains all trade messages for US corporate bonds
  + TRACE is a large database
  + Trade message have to be cleaned to filter cancellations, etc.

---
## TRACE cleaning code

- The cleaning code for TRACE is based on Dick-Nielsen (2009, 2014)

- We wrote a function for your use
  + Use the existing connection to WRDS
  + Download and clean your sample in one step

- For the function, you want the package `{devtools}` and then

``` r
library(devtools)
source_gist("3a05b3ab281563b2e94858451c2eb3a4")
```

---
## TRACE download: Select bonds

- Based on Mergent FISD, you want to select a sample of bonds that are relevant for your project

- Avoid downloading unecessary data, i.e., only download what you need:

``` r
mergent_cusips <- mergent |>
  pull(complete_cusip)

mergent_parts <- split(
  mergent_cusips,
  rep(1:10, 
      length.out = length(mergent_cusips)))
```

---
## TRACE download: Loop

- Equipped with the cleaning function and the bond sample, we can loop through the download process

``` r
trace_enhanced <- clean_enhanced_trace(
    cusips = mergent_parts[[j]],
    connection = wrds,
    start_date = ymd("2014-01-01"),
    end_date = ymd("2016-11-30")
  )
```

---
class: chapter-slide

# Data storage for research

---
## Why to use a database

- We avoided downloading CSV-files and similar separate files

- However, how do we store our data efficiently now?

- We propose setting up a SQLite database
  + We can store all data in one file
  + We can add, change, and delete individual data sets
  + We do not have to load the full data package

---
## RSQLite and dbplyr

- The most efficient way is to use an SQLite database, which is the C-language library that implements a small, fast, self-contained, high-reliability, full-featured, SQL database engine

- We can use this powerful tool with `{RSQLite}` and `{dbplyr}`

``` r
library(RSQLite)
library(dbplyr)
```

---
## Setting up a database

- An SQLite database is easily created - the code below is really all there is:

``` r
tidy_finance <- dbConnect(
  SQLite(),
  "data/tidy_finance.sqlite",
  extended_types = TRUE
)
```

- We use the `extended_types=TRUE` option to enable date types when storing and fetching data

---
## Write and load data

- We can write data to our database

``` r
dbWriteTable(tidy_finance,
  "factors_ff_monthly",
  value = factors_ff_monthly,
  overwrite = TRUE
)
```

- And we can load from the database

``` r
factors_ff_monthly_db <- tbl(tidy_finance, "factors_ff_monthly") |> 
  collect()
```

---
class: chapter-slide

# Other data provides

---
## Alternative data sources

- Yahoo!finance provides some interesting data
  + Access is easy with the `tidyquant` package
  + Check out the introductory Chapter 1 on Tidy Finance
  
- Macroeconomic data is available from FRED, via `fredr` or `alfred`

- Bloomberg and Refinitiv data is also available via R packages

- More risk factors from [Global factor data](https://jkpfactors.com/) and [Open source asset pricing](https://www.openassetpricing.com/data)

- Crypto data from [coinmarketcap](https://www.tidy-finance.org/coinmarketcap.com) via `crypto2` package

---
class: chapter-slide

# Final remarks

---
## Conclusion

- Now, you know some important data providers for financial data

- You also know how to access and store the data

- Additionally, we discussed the importance of investigating the data input

- This workshop covers several topics and leaves some details due to time constraints: Check out [Tidy Finance with R](https://www.tidy-finance.org) to find out more.

---
class: middle, left
background-image: url("data:image/png;base64,#images/cover.jpg")
background-position: right
background-size: 400px

### Reach out with comments, questions, or <br> suggestions: <svg aria-hidden="true" role="img" viewBox="0 0 512 512" style="height:1em;width:1em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:blue;overflow:visible;position:relative;"><path d="M64 112c-8.8 0-16 7.2-16 16v22.1L220.5 291.7c20.7 17 50.4 17 71.1 0L464 150.1V128c0-8.8-7.2-16-16-16H64zM48 212.2V384c0 8.8 7.2 16 16 16H448c8.8 0 16-7.2 16-16V212.2L322 328.8c-38.4 31.5-93.7 31.5-132 0L48 212.2zM0 128C0 92.7 28.7 64 64 64H448c35.3 0 64 28.7 64 64V384c0 35.3-28.7 64-64 64H64c-35.3 0-64-28.7-64-64V128z"/></svg> [contact@tidy-finance.org](mailto:contact@tidy-finance.org)

&nbsp;

## <svg aria-hidden="true" role="img" viewBox="0 0 512 512" style="height:1em;width:1em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M464 256A208 208 0 1 1 48 256a208 208 0 1 1 416 0zM0 256a256 256 0 1 0 512 0A256 256 0 1 0 0 256zM294.6 135.1c-4.2-4.5-10.1-7.1-16.3-7.1C266 128 256 138 256 150.3V208H160c-17.7 0-32 14.3-32 32v32c0 17.7 14.3 32 32 32h96v57.7c0 12.3 10 22.3 22.3 22.3c6.2 0 12.1-2.6 16.3-7.1l99.9-107.1c3.5-3.8 5.5-8.7 5.5-13.8s-2-10.1-5.5-13.8L294.6 135.1z"/></svg> visit [tidy-finance.org](https://www.tidy-finance.org)