Using a random forest classifier to predict Bitcoin price fluctuations using activity from the richest Bitcoin addresses
There is no shortage of recent articles and hype surrounding cryptocurrencies. Currencies like Bitcoin, Dogecoin, and others provide celebrities and seasoned and amateur investors alike with an opportunity to amass life-changing amounts of wealth. Tesla has even turned its $1.5B investment in Bitcoin into $1B in profit!
Those who buy and sell cryptocurrencies like Bitcoin are assigned addresses to track their transactions on the blockchain. The disparity in the distribution of balances for these addresses is quite large — so much so that a subset of these addresses are dubbed “whales” to reflect how their actions have the power to manipulate the currency’s valuation. As an example, Elon Musk made comments on his appearance on SNL about Dogecoin that drove prices down, then tweeted about and prices jumped back up.
This causes me to think — what if investors do not actually need to bet on what the next hot currency would be in order to make a significant profit? Instead, they might be able to predict when whales will sell off coins and short the currency.
In addition to news and current events, I suspect that whales buy and sell cryptocurrency largely in response to price fluctuations. Note that the analysis that follows does not attempt to answer what causes whales to buy and sell cryptocurrency (certainly there are more factors than price alone), but rather whether or not whales will sell Bitcoin within the next 24 hours.
To answer this question, I use historical Bitcoin (BTC) price data from Gemini Exchange and transaction history for the richest 100 BTC addresses (i.e., the whales). I limit my analysis to the period between October 7, 2015 and April 20, 2021 since this allows me to reach as far back into the past as possible while also ensuring a consistent time frame between the two datasets (at the time I made my model).
I start by pulling in the historical BTC price data for each year, by minute, and concatenating it into a single dataframe. Additionally, I perform some preliminary cleansing steps (i.e., dropping unnecessary columns, converting dates to datetime format in Pandas, and setting date as my index).
I determine upfront that I will roll all data to the ‘day level’ instead of attempting to predict the aforementioned question at the minute and second level. Therefore, I downsample the minute data to days and check for NaN, knowing I will need to return to impute. The historical BTC price data does not have many features, so I decide to engineer my own.
Since I suspect that fluctuations in BTC price may help determine when a whale buys or sells, I create rolling geometric mean and percent change features for several time periods for BTC price open, high, low, close, and volume. This gives me 130 features total.
Next I need to pull in whale transaction history. I could not find an API to pull in this data, so I copy pasted into a Google Sheet and performed similar preliminary cleansing steps as with the historical BTC price dataset.
The “Amount” column looks like trouble. All values within “Amount” that are floats are negative (I know this because I analyzed the data as I pulled it offline). So, I reformat the values in that column into an appropriate format for a machine learning task (i.e., removing spaces, converting to float, etc).
Again, I drop unnecessary columns, downsample to day-level data, and restrict to just the time period under investigation. Note that in downsampling to day-level data I am creating what will become my target and use the sum of all negative values within the “Amount” column. Finally, I confirm that the length of the whale transaction dataframe matches that of the historical BTC price one and merge the two into a single dataframe.
I reformat the merged data into something appropriate for a binary classification task. Since I am trying to predict if the next day’s amount will be negative, I convert my target to an integer and shift it upward by 1 day. Additionally, I impute NaN values with the next valid value in each column and replace infinite values (due to division by 0 in the engineered features) with the mean. Note that I spare the reader from additional detail regarding how I determined where the infinite values are located.
Finally, I check the count for my positive class. Unsurprisingly, the positive class is the minority. Note that I compared results using SMOTE (upping the count of the positive class) and not using SMOTE on my model’s overall AUC and noted no significant difference.
Since random forest classifiers do not require feature scaling and are resistant to overfitting, I decide this is a good place to start. I split my data into train and test and instantiate and fit the model.
I evaluate the model’s performance using the area under the receiver operating characteristic curve (ROC AUC). The AUC for the initial fit is 0.73. I also use RandomizedSearchCV to tune the model’s hyperparameters, but note no improvements in the model’s performance.
My model indicates that investors may not actually need to bet on what the next hot cryptocurrency will be in order to make a small fortune. Instead, investors can place bets on whales moving the valuation of BTC downward to drive their own profitability.
Further analysis of what else drives a whale to sell BTC (i.e., current events), additional classifiers (XGBoost, etc.), and attempting the existing experiment at different frequencies (minute, week, hour, etc.) would be interesting follow-up studies and may more closely and accurately predict when a whale will sell BTC. For now, though, this will do.
The Jupyter notebook of code used to arrive at these conclusions can be found on Github here.
* I hold no responsibility for any investments made using this model.