04 April 2020

My Take on Beating the Market with Machine Learning Models - an Elephant in the Room

 If you google things like "machine learning model predicting stock price", you would find tons of articles lecturing how this can be done. The reality is there haven't been individual data scientists becoming rich from their models, which might set off the alarm that this is too good to be true.

 The intention here is not to alienate myself by labelling these authors as scammers but to have a discussion of how machine learning works and hopefully everyone can learn something or raise more interesting questions, which without any doubt benefits the whole community. After all, those authors works on the analysis, visualisation, and modelling are well appreciated.

 First off with the strategies. Bear with me and you will see this is actually relevant. There are thousands out there if not millions and new ones still keep popping out everyday. A simple example would be buy when RSI indicates oversold and sell when it indicates overbought. The invention of RSI is purely random based on no scientific evidence but the inventor's personal experience. It's a process that a human observes the evolution of the world and tries to grasp on some patterns. The patterns can be useful. Some of them give birth to true science. A caveat though is the patterns need to undergo strict scrutiny for verification. Otherwise it suffers the plague in machine learning: overfitting.

 Seriously the strategies resemble a machine learning model manually sought by humans. Instead of algorithms optimising the outcome, the trading practitioners cherrypick the parameters to manage better profits. It gives overfitting two channels to creep in.

 First the importance of cross-validation on any model isn't an expertise of the strategy creators. No claims have been made that the strategies have been validated following the best practices in machine learning. What's common is a selected period is chosen to demonstrate the strategy.

 That leads to the second one: human psychology.

 Human tends to believe what they want to believe. So much so they are blind to millions of examples where the model/strategy fails. In the case of RSI, there are plenty of the them at disposal.

 While overfitting trips the non machine learning professionals, there is a subtly hidden profound flaw shaking the foundation of practice using historical prices to predict future prices.

 As hidden as it is, it's actually as obvious as the elephant in the room.

  The price is what the model supposedly to output. But at the same time it's input too. This setup falls into the pitfall of "Correlation does not equal to causation" as the matter of fact that they are at best correlations.

 The purpose of a predictive machine learning model is to mathematically find the causation than correlation.

 Causation is the true dependence between two entities. When one changes the other one must to some extent adjust itself accordingly.

 Correlation, on the other hand, is the similar behaviour shared between entities. They might depend on the same reason or simply co-incidence. Therefore they are susceptible to influence that works on one but not the others. The correlation hence can halt at any given time.

 An illustrative example would be the correlation between the sale of ice cream and bushfire. It has been shown these two are correlated (sometimes). It would be ridiculous to argue shutting down ice cream business can help put out bushfires because bushfires don't obligate to your favourite dessert, i.e. no causation here. It makes more sense to theorise that the common drive here is the temperature. The higher the temperature, the higher the ice cream profits, so is the risk of a lurking bushfire.

 Interestingly humidity might be able to break this correlation. It should mitigate the danger of bushfire but not hinder people from going to the ice cream stands. When it's humid, higher temperature may not see bushfire but ice creams may still sell off.

 Back to the price prediction. The previous and current price levels are comparable to the analogy between ice cream and bushfire (although they are both prices at different times). Optimistically, there is correlation between the two. However fundamentally they are dictated by the logistics of the economy, politics, and even public health crisis as it stands at the time of this is written. Given two periods share close environmental fundamentals, the price levels can fare in a similar fashion (correlated). Nevertheless this deceptive correlation can broke you when an attribute is switched on in a new period unexpectedly like the outbreak of a pandemic.

 In conclusion, a machine learning model predicting future prices given historical prices is not reliable to beat the market in that it is merely a model to catch the wave that can come to a halt at any moment. So what are your thoughts? If you are with me then save your time on other interesting projects. In any case, you are welcome to ask me any questions.