Financial Modeling Prep (FMP) is a new service that offers a range of financial metrics through an API. Use fmpcloudr to access the data and analyze it in R.
Financial Modeling Prep and the sister site FMP Cloud provides access to a range of financial data points. This article will use the fmpcloudr package to pull data from the API into R. The analysis will look at historical returns, trade volume, and the composition of the S&P 500. One thing to note, because FMP is still relatively new, some of the historical data points are unavailable. This will be highlighted within the analysis.
First, set up the R environment:
# Load the necessary libraries
library(fmpcloudr)
library(dplyr)
library(lubridate)
library(plotly)
# Set options and create if.na function
options(scipen=999)
options(dplyr.width = Inf)
if.na <- function(x,y)(ifelse(is.na(x),y,x))
# Set the FMP token - stored in a local location on server
fmpKey = readRDS('/Users/tonytrevisan/Documents/Server/Secure/fmp.rds')
fmpcloudr::fmpc_set_token(fmpKey, noBulkWarn = TRUE)
FMP provides access to index data which is not the case for many of the free and inexpensive pricing APIs available. For many of the Total Return indexes data only goes back to April of 2020, but the Price Return indexes have a much deeper history. Note: Total Return includes dividend reinvestment.
First we will pull index data for the three major indexes: S&P
500, Nasdaq, and Dow Jones combined with Gold. We can search for these
symbols using fmpc_symbols_by_market
.
# Search for available symbols for indexes and commodities
AvailIndx = fmpc_symbols_by_market(c('index','commodity'))
# Filter for Gold, S&P, Dow and Nasdaq
AvailIndx %>%
mutate(lowerName = tolower(name)) %>%
filter(grepl('gold',lowerName) | grepl('s&p 500',lowerName) |
grepl('dow jones ind',lowerName) | grepl('nasdaq',lowerName)) %>%
select(1,2) -> symbName
knitr::kable(symbName)
symbol | name |
---|---|
ZGUSD | Gold |
GCUSD | Gold |
YMUSD | Mini Dow Jones Indus.-$5 |
ESUSD | E-Mini S&P 500 |
NQUSD | Nasdaq 100 |
MGCUSD | Micro Gold Futures |
^SP500TR | S&P 500 (TR) |
^XNDX | NASDAQ 100 Total Return Index |
^NDX | NASDAQ 100 |
^VXN | CBOE NASDAQ 100 Volatility |
^IXIC | NASDAQ Composite |
^DJITR | Dow Jones Industrial Average TR |
^GSPC | S&P 500 |
^GVZ | CBOE Gold Volatitity Index |
^SP500-30 | S&P 500 Consumer Staples (Sector) |
^SP500-40 | S&P 500 Financials (Sector) |
^SP500-45 | S&P 500 Information Technology (Sector) |
^SP500-25 | S&P 500 Consumer Discretionary (Sector) |
^DJI | Dow Jones Industrial Average |
^XSP | S&P 500 MINI SPX OPTIONS INDEX |
Once we have the symbols we can use fmpc_price_history
to pull the price history for each index. We will pull the Price Return
index due to the longer history. Luckily the S&P total return is
also available.
# Pull data for price return indexes, Gold, and S&P TR back to 2000
IndxHist = fmpc_price_history(c('^NDX','^DJT','^GSPC','^SP500TR','GCUSD'),
startDate = '2000-01-01')
# Calculate the cumulative return of each symbol
IndxHist %>%
# Bring in Naming data
left_join(select(AvailIndx,symbol,name), by = 'symbol') %>%
group_by(symbol) %>%
# Calculate the cumulative return for each symbol
mutate(return = if.na(log(adjClose/lag(adjClose)),0), # Calculate log returns for each index
name = ifelse(symbol == 'GCUSD','Gold',
ifelse(symbol == '^DJT','Dow Jones (PR)',
ifelse(symbol == '^NDX','Nasdaq 100 (PR)',name))),
cumReturn = exp(cumsum(return))-1) -> IndxReturn
Finally, we can plot the data to see how the cumulative return looks overtime.
# use plotly to create interactive plot
plot_ly(IndxReturn) %>%
add_trace(x = ~date, y = ~cumReturn, color = ~name, type = 'scatter', mode = 'lines') %>%
layout(title = 'Major Index Performance since Jan 2000',
hovermode = 'compare',
legend = list(orientation = "h", xanchor = "center", x = 0.5),
xaxis = list(title = ''),
yaxis = list(title = 'Return', tickformat = '.0%'))
As can be seen above, gold has performed very well since 2000 despite a massive bear market between 2011-2015. The other averages have all moved in line with each other. The reason the Nasdaq looks to be underperforming over this time period is because of the huge draw down during the tech crash in the early 2000s.
As mentioned above and in a previous article, the price return and total return differ due to the reinvestment of dividends. While dividends typically range between 1% and 2%, the compounding effect can be quite dramatic.
In this analysis we will compare the price return index to the total return performance of the ETFs. As a point of comparison, I have also included the S&P total return index to see how SPY performs against the index. The SPY return will be slightly below the index due to ETF fees.
# Pull the ETF data for S&P and Nasdaq - QQQ and SPY
ETFHist = fmpc_price_history(c('QQQ','SPY'), startDate = '2000-01-01')
# Calculate the returns
ETFHist %>%
group_by(symbol) %>%
mutate(return = if.na(log(adjClose/lag(adjClose)),0),
cumReturn = exp(cumsum(return))-1) %>%
filter(date == max(date)) -> ETFReturn
# Pull the Index data from the analysis above
NasSP = IndxReturn %>%
filter(symbol %in% c('^GSPC','^NDX','^SP500TR')) %>%
filter(date == max(date))
# Create a name map
map = tibble(symbol = c('^GSPC','^NDX','^SP500TR','QQQ','SPY'),
plotname = c('SP 500 PR','Nasdaq PR','SP 500 TR','Nasdaq ETF TR','SP 500 ETF TR'))
# Stack the data and take the max date which has the cumulative return for the period
rbind(NasSP,ETFReturn) %>%
filter(date == max(date)) %>%
left_join(map, by = 'symbol') -> PlotTRPR
Once the returns have been calculated, we can plot the data using a bar chart to show the cumulative return over the time period. The tech heavy QQQ is not known for high dividends and this can be seen in the chart below. The S&P TR vs PR difference is much greater than the same difference for Nasdaq (108% differnce versus only 34% for the Nasdaq).
# Plot the ETF Data in a bar chart
plot_ly(PlotTRPR) %>%
add_trace(x = ~plotname, y = ~cumReturn, color = ~plotname, type = 'bar') %>%
layout(title = 'Price Return vs Total Return',
hovermode = 'compare',
showlegend = FALSE,
xaxis = list(title = ''),
yaxis = list(title = 'Return', tickformat = '.0%'))
ETFs have become extremely popular over the past twenty years. We can use trading on the S&P 500 as a proxy to see the increase in activity. Daily trade volume is a great indication along with total assets. There are three main ETFs that track the S&P 500: SPY, IVV, and VOO. We can analyze the trade volume on all three to see how trade volume has trended over time. Below we will pull the data and plot it into a stacked bar chart.
# Pull the three tickers back to SPY inception
SPETF = fmpc_price_history(c('IVV','VOO','SPY'), startDate = '1993-01-01')
# Calculate notional volume, and take daily average for each quarter
SPETF %>%
mutate(notionVol = adjClose * unadjustedVolume) %>% # Notional amount is price * share volume
mutate(qtrEnd = ceiling_date(date, unit = 'quarter')-1) %>% # Calculate qtr end date
group_by(qtrEnd, symbol) %>% # Group qtr/symbol to calcualte average
summarise(avgDlyvol = mean(notionVol)) %>%
arrange(qtrEnd) -> ETFVol
# Plot the data in a stacked bar chart
plot_ly(ETFVol) %>%
add_trace(x = ~qtrEnd, y = ~avgDlyvol, color = ~symbol, type = 'bar') %>%
layout(title = 'Average Daily Volume',
hovermode = 'compare',
barmode = 'stack',
legend = list(orientation = "h", xanchor = "center", x = 0.5),
xaxis = list(title = ''),
yaxis = list(title = 'Notional Daily Volume'))
As can be seen above, trade volume has increased dramatically in S&P ETFs since the inception of the first ETF (SPY) back in 1993. One glaring takeaway is how much SPY dominates the trade volume. Ironically, this is not due entirely to size. In the pull below, we can see that even though SPY has the largest market cap (almost 50% of the total), it’s trade volume dominates in comparison garnering over 90% of the trade volume. SPY has always been the most popular ETF and is used by many intuitions due to its liquidity.
# Proiles show the current market cap
knitr::kable(fmpc_security_profile(c('SPY','IVV','VOO')) %>%
mutate(mktCap = format(mktCap, big.mark = ','),
volAvg = format(volAvg, big.mark = ',')) %>%
select(symbol, volAvg, mktCap, companyName), "simple")
symbol | volAvg | mktCap | companyName |
---|---|---|---|
SPY | 79,869,183 | 407,302,477,774 | SPDR S&P 500 ETF Trust |
IVV | 3,757,808 | 277,428,862,586 | iShares Core S&P 500 ETF |
VOO | 3,576,846 | 0 | Vanguard 500 Index Fund |
The S&P 500 index was formed in 1957 and has changed dramatically over that time. Many companies have been added and removed based on certain selection criteria. Recent articles have highlighted how the concentration in the top 5 companies has never been greater. This analysis will attempt to replicate those findings using FMP.
Reconstructing the S&P 500 historically is a bit tricky, but FMP offers API endpoints that allow you to reconstruct the index historically and chart the concentration.
As with any data exercise, many times the data is not perfect. Unfortunately that is the case here, so the data must be cleaned and modified. The FMP support stuff is extremely responsive and very helpful. They have been great at correcting any data issues I flag to their attention. This is a bit low on the priority list, so I have not raised the issues herein, but I also think the data is good enough for exploration. Additionally, it could be challenging to pull historical market cap data for delisted companies.
The goal of this section will be to analyze the top 5 companies in the S&P by doing the following:
plotly
to create an overlay of the S&P 500
overtime# Pull the current list of the S&P 500
spcur = fmpc_symbols_index(period = 'current')
# Pull the historical list of companies that have been added and removed
sphist = fmpc_symbols_index(period = 'historical', index = 'sp500') %>%
mutate(date = as.Date(date))
The data goes back to 1994, but as mentioned above, the data is not perfectly clean and consistent. For example, Morgan Stanley appears in the current S&P 500 list, but also shows as having been removed in the historical S&P 500 list back in 1997 without ever being added again.
symbol | name | sector | subSector | headQuarter | dateFirstAdded | cik | founded |
---|---|---|---|---|---|---|---|
MS | Morgan Stanley | Financials | Investment Banking & Brokerage | New York City, New York | 1993-07-29 | 0000895421 | 1935 |
dateAdded | addedSecurity | removedTicker | removedSecurity | date | symbol | reason |
---|
I don’t want to get too caught up with all the data issues, so instead I will make notes about them in the comments.
To simplify the output, I only want to show a snapshot of the S&P at the end of each year. The data set starts in 1994. Thus, I will take the S&P 500 as of 1994 and loop through each year up to the current, adding and removing the tickers as indicated by the data set. Because of the issue highlighted above with Morgan Stanley, I will also force add any ticker currently in the S&P 500 that has an add date on or before the year in the loop.
Note: I recognize that loops are frowned upon in R, better to use a function within map or apply, but I thought this would be easier to read.
# Filter for the first year of 1993
sp500yr = sphist %>%
filter(year(date)==1994) %>%
mutate(date = as.Date('1994-12-31')) %>%
select(symbol,date)
# Make this the "previous year" to start the loop
prevYear = sp500yr
# Create a vector of years to Loop through
yrloop = 1995:year(Sys.Date()-20)
# Loop through each year from 1995 - 2020
for (i in yrloop) {
# Identify securities that have been removed
rmvd = sphist %>%
filter(year(date)==i, removedSecurity != '') %>%
pull(symbol)
# Identify securities that have been added
added = sphist %>%
filter(year(date)==i, addedSecurity != '') %>%
select(symbol) %>%
mutate(date = as.Date(paste0(i,'-12-31')))
# Use the previous year tickers as a starting point for this year and update the date
Curyear = prevYear %>%
mutate(date = as.Date(paste0(i,'-12-31')))
# For current S&P 500 funds include any tickers that were added on or before the loop year
# This is to correct for the Morgan Stanley issue noted above
spcur %>% filter(substr(dateFirstAdded,1,4)<=i) %>%
mutate(date = as.Date(paste0(i,'-12-31'))) %>%
select(symbol,date) -> currentCheck
# Bind them together, filter out removed tickers, and make unique
Curyear = rbind(Curyear,added,currentCheck) %>%
filter(!(symbol %in% rmvd)) %>%
distinct()
# The current year becomes the new previous year
prevYear = Curyear
# Append each year to the next in the loop
sp500yr = rbind(sp500yr,Curyear)
}
Once the loop is complete we can look at the data to see how it turned out
table(sp500yr$date)
1995-12-31 1996-12-31 1997-12-31 1998-12-31 1999-12-31 2000-12-31
160 162 177 189 199 213
2001-12-31 2002-12-31 2003-12-31 2004-12-31 2005-12-31 2006-12-31
222 235 241 247 253 264
2007-12-31 2008-12-31 2009-12-31 2010-12-31 2011-12-31 2012-12-31
283 307 325 341 355 372
2013-12-31 2014-12-31 2015-12-31 2016-12-31 2017-12-31 2018-12-31
390 404 426 443 465 480
2019-12-31 2020-12-31 2021-12-31 2022-12-31 2023-12-31
493 502 511 516 518
As can be seen, the data is not perfect but is close enough. There are 476 tickers starting in 1994 building up to 500. There are currently 505 tickers in the SP 500, but you can also see some years had more than this. The data will be cleaned up more in later steps.
FMP also provides an endpoint to collect historical market cap data. Unfortunately the history here is not great. For any companies that were not actively traded after Jan 2020, the history is unavailable which will include many removed names.
This API request will fetch a large amount of data for over 1,000 securities. Additionally it will put time spacers between each request to ensure the API calls do not exceed the FMP limit. The total run time will be about 3-5 minutes.
Note: This is an example of using lapply instead of a for loop
# Get a list of unique symbols
unqsym = unique(sp500yr$symbol)
# 5 minutes to run
# Use the bind rows function to collapse each pull
mktcapSP = bind_rows(lapply(unqsym, function(x) {
# Pull the market cap for at least 7,000 trading days (~1993)
tick = fmpc_security_mrktcap(x, limit = 7000)
if(is.null(tick)) return()
# Modify the pull to shrink and clean the data
tick %>%
mutate(year = year(date)) %>%
group_by(year,symbol) %>%
filter(date <= '2020-09-30') %>% # Ignore the most recent data due to issues in mkt cap
filter(date == max(date)) %>% # Take last day available for each year
ungroup
}))
We can aggregate the data by grouping into the top 5 companies on each day. Calculating the percentage of the top 5 then becomes relatively easy.
# Create market cap percentage for each year where data is available
sp500yr %>%
mutate(year = year(date)) %>%
left_join(select(mktcapSP,-date), by = c('symbol','year')) %>% # join in market cap data
mutate(marketCap = ifelse(is.na(marketCap),0,marketCap)) %>% # Set NA's to 0
group_by(year) %>%
arrange(desc(marketCap)) %>%
mutate(grp = ifelse(row_number()>5,'Outside top 5','Top 5')) %>%
group_by(year,grp) %>%
summarise(marketCap = sum(marketCap)) %>%
mutate(percentage = marketCap/sum(marketCap)) ->
mktcapYear
top5Perc = filter(mktcapYear, grp == 'Top 5')
plot_ly(mktcapYear) %>%
add_trace(x = ~year, y = ~marketCap, name = ~grp, color = ~grp, type = 'bar') %>%
add_trace(data = top5Perc, x = ~year,y= ~percentage,name = 'Top 5 companies market cap %', yaxis = "y2",
type = 'scatter', mode='line',line=list(color='black')) %>%
layout(title = 'Market Cap and Percentage of top 5 companies',
barmode = 'group',
hovermode = 'compare',
xaxis = list(title=''),
margin = list(l = 75, r = 75, b = 50, t = 50, pad = 4),
yaxis2 = list(side = 'right', overlaying = 'y' , title='Top 5 Market Cap %',zeroline = F,showgrid = F,tickformat = '.1%'),
yaxis = list(side = 'left',title='Market Cap',zeroline = F,showgrid = T),
legend = list(traceorder='reversed',orientation = "h"))
Based on the chart above, the peak was hit back in 2006-2009 even though it is rising rapidly again. I think this may be due to missing market cap data the further back we go. Let’s see below.
The data above does not align with the articles that have recently been published about the S&P 500 reaching a historic concentration in the top 5 securities. To better understand what is going on, it will be beneficial to use an animation plot.
In order to plot the data correctly in an animation, there must be an equal number of groups in each frame of the time lapse. Therefore if there are 400 securities in one year and 500 in another, you must use the lowest common denominator for all years.
# Check for valid market caps by year
sp500yr %>%
mutate(year = year(date)) %>%
left_join(select(mktcapSP,-date), by = c('symbol','year')) %>% # join in market cap data
mutate(marketCap = ifelse(is.na(marketCap),0,marketCap)) %>% # Set NA's to 0
filter(marketCap != 0) -> # Filter our 0's
valid_mkt_cap
table(valid_mkt_cap$year)
1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008
156 158 172 185 193 203 211 224 229 231 239 250 267 290
2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020
306 318 329 343 361 372 397 423 449 466 481 490
As can be seen in the table above, our lowest common denominator is 203. No surprise this is back in 1994 because the data is more scarce the farther back you go. This essentially implies that 60% of the S&P from 1994 no longer trades!
In a perfect world, we would have all 500 tickers for each year, but this is not the case. That being said, the most likely reason for removal from the S&P is low market cap, acquisition, or cease trading. Generally speaking, this will affect smaller companies more frequently, so hopefully should not invalidate our findings. We can confirm with a second plot.
In the next step we will reduce our population to the 200 largest companies by market cap for each year in the analysis. We will then bucket these into 40 groups based on market cap, so that each group will have exactly 5 tickers.
After this step we can plot the data
valid_mkt_cap %>%
group_by(year) %>% # Group the data for each year
arrange(desc(marketCap)) %>% # Arrange in descending order
filter(row_number()<=200) %>% # Take the top 200 funds by year
mutate(ngroup = paste0('grp:',ntile(marketCap,40))) %>% # Create 40 buckets in each year
group_by(year,ngroup) %>% # Group by year and bucket
summarise(marketCap = sum(marketCap)) %>% # Sum the market cap of each bucket
group_by(year) %>% # Group by year
mutate(mktcapPer = marketCap/sum(marketCap)) %>% # Calc the % of each bucket by year
ungroup() -> plotReady
Once the data has been set so that each year has 40 buckets sorted by market cap, we can plot the data into a plotly animation chart.
plot_ly(
plotReady,
x = ~marketCap, y = ~mktcapPer,
size = ~marketCap, name = 'Mrkt Cap Bucket',
frame = ~year, hoverinfo = "text",
text = ~paste0(ngroup,'\nMkt Cap ($M):',format(round(marketCap/1e6,0),big.mark=","),
'\nMkt Cap %:',round(mktcapPer*100,2),'%'),
type = 'scatter', mode = 'markers'
) %>%
layout(
title = 'Concentration of the S&P 500 over time',
xaxis = list(title = 'Market Cap'),
yaxis = list(title = 'Market Cap % of total', tickformat = '.1%')
)
The full playback looks good. Based on this chart we are not at the most concentration. As can be seen, in 2008 we crossed over 35% for the 5 largest companies and now currently sit below 30%. That being said, we are only looking at 200 of the 500 companies, which could be very misleading.
To confirm our findings, let’s move our time lapse up to 2006. By doing so, we can keep the top 340 companies for each year instead of 200 (as shown above). This means we will have 68 groups rather than 40. First we run our manipulation followed by the plot.
valid_mkt_cap %>%
filter(year>=2006) %>%
group_by(year) %>% # Group the data for each year
arrange(desc(marketCap)) %>% # Arrange in descending order
filter(row_number()<=340) %>% # Take the top 340 funds by year
mutate(ngroup = paste0('grp:',ntile(marketCap,68))) %>% # Create 68 buckets in each year
group_by(year,ngroup) %>% # Group by year and bucket
summarise(marketCap = sum(marketCap)) %>% # Sum the market cap of each bucket
group_by(year) %>% # Group by year
mutate(mktcapPer = marketCap/sum(marketCap)) %>% # Calc the % of each bucket by year
ungroup() -> plotReady
Plotting again
plot_ly(
plotReady,
x = ~marketCap, y = ~mktcapPer,
size = ~marketCap, name = 'Mrkt Cap Bucket',
frame = ~year, hoverinfo = "text",
text = ~paste0(ngroup,'\nMkt Cap ($M):',format(round(marketCap/1e6,0),big.mark=","),
'\nMkt Cap %:',round(mktcapPer*100,2),'%'),
type = 'scatter', mode = 'markers'
) %>%
layout(
title = 'Concentration of the S&P 500 over time since 2006',
xaxis = list(title = 'Market Cap'),
yaxis = list(title = 'Market Cap % of total', tickformat = '.1%')
)
Even by adding in an extra 340 companies we are still showing 2007 as higher concentration than today. Unfortunately, it looks like our analysis is not correct. There are many articles showing exactly the opposite of our findings here.
What can be causing the discrepancy? As mentioned several times, incomplete data. Although the bottom 200 or 300 companies will not be as large as the upper end, there could be enough market cap to skew the weightings. I do not blame FMP for this. They only started collecting data as of 2020. In order to get a complete picture, we would most likely have to get data from a paid source.
In this article, we used R and several packages to analyze financial
stock data. We looked at price history, trade volume, and S&P 500
concentration. This merely scratches the surface of the data available
from FMP. I highly recommend digging into the fmpcoudr
package to see what else is available.
Disclosure: The content herein is my own opinion and
should not be considered financial advice or recommendations.
For attribution, please cite this work as
Trevisan (2022, Nov. 1). ALT Analytics: Analyzing the S&P 500 with FMP. Retrieved from https://www.altanalyticsllc.com/posts/2022-11-01-analyzing-sp500-with-fmp/
BibTeX citation
@misc{trevisan2022analyzing, author = {Trevisan, Tony}, title = {ALT Analytics: Analyzing the S&P 500 with FMP}, url = {https://www.altanalyticsllc.com/posts/2022-11-01-analyzing-sp500-with-fmp/}, year = {2022} }