14 June 2020

A Million Dollar Life Wisdom

In this short blog, I would like to share with you a million dollar life wisdom of mine. You can thank me for my generosity later.


Before I tell you the wisdom, you have to know a little bit of my story before this wisdom was discovered, which helps you understand where it's coming from and why it is valuable.


Back a few years, leaving academia I set out on my journey for my first job as a data scientist. I was like a human sponge thirsty for new knowledge. My eagerness, however, was largely driven by my desire to get things done as quickly as possible with my newly acquired knowledge. Little did I know or little was I willing to admit that I was expecting a pie in the sky. What’s in the way is my lack of experience from practice. I often got frustrated by the errors when running programming scripts, which almost unmercifully destroyed my confidence in myself.


Until one day.


On that day, I solved a problem that had been blocking me from my study in deep learning for months. I won’t go into the details of the problem and the solution here. The moment I saw myself smashed it first of course the joy ran through the body and relaxed all the muscles. Then the priceless wisdom was born out of a thirty-second reflection.


Although in the past few months I couldn’t do what I had planned to do, to solve the problem I had tried to view the problem from various angles seeking a weak link to break through. For each of the views or new ideas massive Google searches were involved for verification or exploration, which turns out to be the best way of learning: trial and error. Looking back I didn’t just solve the problem. I had also obtained a load of knowledge in the same or related fields without even realising it, which is indeed a blessing in disguise.


So here is the wisdom I have promised from the beginning: Learning from problems makes truly perfect.


Embrace the problem you will encounter and learn from them. With this in your mind, you will observe how much more knowledge will be as a fact obtained in the process of making sense of the problem and finding solution to it. An easy run without any hiccups is, counterintuitive to most people, a big loss of opportunity to becoming perfect.


Since that unforgettable day, though I cannot say I never get frustrated once again on agonising problems, most of the time I have a genuine relaxed smile on my face and excitement inside, because I know I have been given another opportunity to become better.


So my dear friends. Criticise me. That's a big help for me. I thank you for that!

04 April 2020

My Take on Beating the Market with Machine Learning Models - an Elephant in the Room

 If you google things like "machine learning model predicting stock price", you would find tons of articles lecturing how this can be done. The reality is there haven't been individual data scientists becoming rich from their models, which might set off the alarm that this is too good to be true.

 The intention here is not to alienate myself by labelling these authors as scammers but to have a discussion of how machine learning works and hopefully everyone can learn something or raise more interesting questions, which without any doubt benefits the whole community. After all, those authors works on the analysis, visualisation, and modelling are well appreciated.

 First off with the strategies. Bear with me and you will see this is actually relevant. There are thousands out there if not millions and new ones still keep popping out everyday. A simple example would be buy when RSI indicates oversold and sell when it indicates overbought. The invention of RSI is purely random based on no scientific evidence but the inventor's personal experience. It's a process that a human observes the evolution of the world and tries to grasp on some patterns. The patterns can be useful. Some of them give birth to true science. A caveat though is the patterns need to undergo strict scrutiny for verification. Otherwise it suffers the plague in machine learning: overfitting.

 Seriously the strategies resemble a machine learning model manually sought by humans. Instead of algorithms optimising the outcome, the trading practitioners cherrypick the parameters to manage better profits. It gives overfitting two channels to creep in.

 First the importance of cross-validation on any model isn't an expertise of the strategy creators. No claims have been made that the strategies have been validated following the best practices in machine learning. What's common is a selected period is chosen to demonstrate the strategy.

 That leads to the second one: human psychology.

 Human tends to believe what they want to believe. So much so they are blind to millions of examples where the model/strategy fails. In the case of RSI, there are plenty of the them at disposal.

 While overfitting trips the non machine learning professionals, there is a subtly hidden profound flaw shaking the foundation of practice using historical prices to predict future prices.

 As hidden as it is, it's actually as obvious as the elephant in the room.

  The price is what the model supposedly to output. But at the same time it's input too. This setup falls into the pitfall of "Correlation does not equal to causation" as the matter of fact that they are at best correlations.

 The purpose of a predictive machine learning model is to mathematically find the causation than correlation.

 Causation is the true dependence between two entities. When one changes the other one must to some extent adjust itself accordingly.

 Correlation, on the other hand, is the similar behaviour shared between entities. They might depend on the same reason or simply co-incidence. Therefore they are susceptible to influence that works on one but not the others. The correlation hence can halt at any given time.

 An illustrative example would be the correlation between the sale of ice cream and bushfire. It has been shown these two are correlated (sometimes). It would be ridiculous to argue shutting down ice cream business can help put out bushfires because bushfires don't obligate to your favourite dessert, i.e. no causation here. It makes more sense to theorise that the common drive here is the temperature. The higher the temperature, the higher the ice cream profits, so is the risk of a lurking bushfire.

 Interestingly humidity might be able to break this correlation. It should mitigate the danger of bushfire but not hinder people from going to the ice cream stands. When it's humid, higher temperature may not see bushfire but ice creams may still sell off.

 Back to the price prediction. The previous and current price levels are comparable to the analogy between ice cream and bushfire (although they are both prices at different times). Optimistically, there is correlation between the two. However fundamentally they are dictated by the logistics of the economy, politics, and even public health crisis as it stands at the time of this is written. Given two periods share close environmental fundamentals, the price levels can fare in a similar fashion (correlated). Nevertheless this deceptive correlation can broke you when an attribute is switched on in a new period unexpectedly like the outbreak of a pandemic.

 In conclusion, a machine learning model predicting future prices given historical prices is not reliable to beat the market in that it is merely a model to catch the wave that can come to a halt at any moment. So what are your thoughts? If you are with me then save your time on other interesting projects. In any case, you are welcome to ask me any questions.



29 March 2020

A Plea to All Still Alive

Trapped at home. Hair looked like a bird net. Hands never as clean as is now.

Outside is a world sobering at every corner. Lost mothers and fathers, daughters and sons, sisters and brothers.

Millions of employees lost their jobs, feeling angry and helpless. Migrant workers got stranded on the street under lockdown facing starvation with their home too far away to reach.

The no-brainer origin of what causes tragedy and turmoil has been pushed by its motherland to find a new home. Anywhere is fine.

The origin can be hypothetically anywhere else. Yet it doesn’t prove the first outbreak is innocent.

Everyone’s life has been turned upside down. Average people like you and me are feeling the pain. Yet wrongdoers are adding insult to injury.

We are progressing though. Deniers morph into cowards. “Just another flue”. “Mortality rate is probably 1-2%”. By no means 10-20 times deadlier than the bloodcurdling Spanish Flu is a big deal.

Not at all.

I see.

Until reality kicks in.

It’s bittersweet to watch the “fact advocates” get their new enthusiasm in criticising authorities unwilling to implement draconian measures, leaving their old one behind playing with science like a puppet.

The virus doesn’t have a brain to care whether you are royal, rich, or self-vowed genius with fox-like hair. You fail science, science fails you without any emotion.

Respect science.

16 February 2020

How to Prepare Exam for AWS Solutions Architect - Associate

On 24 January 2020, I was excited to collect the trophy I have been dreaming about: the certificate for AWS Solutions Architect - Associate. While the memory is still fresh, I would like to share my story of how to achieve this. My first-hand experience, I hope, will benefit aspiring learners to get certified.

Why I take the exam


The past couple of years have been a disguised blessing for me. I got redundant at work and had personal difficulties, which explains my discontinued activity in this space. At the end of the day, I landed a job at a great organisation where I am able to add data engineering to my career portfolio (check out why I want to do that: https://doctor-fei.blogspot.com/2020/02/why-i-learned-data-engineering-as-from.html). The organisation encourages and supports constant learning among its employees and, as an AWS partner, values AWS certifications probably the most.

In such an environment, not getting certified needs good reasons rather than getting one: all from the healthy peer pressure. Hence my journey begins.

Be clear about what the exam expects from you


AWS services at the first glance are undoubtedly daunting. By 2020 there are 212 services available. No one can learn everything. Therefore a clear goal is the solid stepping stone to success. Make sure what you are expected to learn for this exam. The exam guideline is surely your good friend.

Have a good teacher


While knowing what to learn is good, a good teacher can make your learning experience enjoyable without the pain. I’m lucky that at my work I have free access to a collection of Udemy courses, where I find the prep course by Stephane Maarek outstandingly useful. It precisely covers all you need to know and also teaches you practical knowledge as well as best practices. The course includes two practice tests which I personally find the most close ones to official tests. The importance of practice tests will be discussed shortly.

Learn efficiently


When it comes to your own due diligence, a smart learner absorbs knowledge with efficiency. You should avoid only following the episodes of the online course and hope for the best. Here are my two cents.

Understand the business problems and how cloud services solve them


Each of the services are developed to solve business problems. Pay close attention to what scenarios a service fits in. A better understanding can dramatically speed up your response to exam questions. A typical question in real exam nearly always starts with a problem, “You work for a start-up. They plan to leverage cloud resources to host their company’s website without long term commitment. What is the best EC2 type for their purpose?” Catching the business related keywords “without long term commitment” here enables you to pick up the correction option quickly.

Hands-on practice


Without hands-on practice, even the best memory can slip off your mind faster than you think. I suggest you open your AWS account and follow all the demos in the course. For one reason or another, you would encounter, at least from my own experience, unexpected errors. Don’t get frustrated when this happens. Conversely I urge you to love and embrace them. To solve the issues you have to google and figure out what it means. In doing this, you most likely read AWS documentations and Stackoverflow discussions. Therefore you learn the nitty gritty of the subject. Additionally, the benefit of this troubleshooting experience can go beyond the exam. The errors will probably appear again in your real work and you have known the solution.

Learn repeatedly


Finally, practice tests boost your performance in the exam. You should time all your practice tests. Learn the pace that is the best for you. At the time of this writing, there are 65 questions to answer within 130 minutes. After a few self-tests, I find myself spending about a minute on each question. In the real exam, I have a slow start but I am not panicking because I know statistically I use less than 130 minutes to answer all the questions and everything is OK. Practice tests can also reveal your weaknesses. Treat what you get wrong like you see an error in hands-on. Dive deeper into the AWS documentations or watch the course video again. Repeating the hands-on when there is one is the best.

Conclusion


That’s all I have got to share. My theory is the general rules here isn’t only helpful to your AWS exams but any exams. I hope you like it. Please ask any questions, leave your comments, or share your success in exams below.

02 February 2020

Why I Learned Data Engineering as from a Data Scientist




Report from a true personal story


Disclaimer: The opinions in this article are restricted by the scope of my personal expriences. Please do NOT take it as the only advice for planning your future career.

The first job I got after leaving academia was data scientist. I loved the opportunity of crunching numbers as daily activities. But later I realised I must acquire experience in data engineering. Here’s my true story.

The first company I worked for is a small and fast growing consulting firm at that time. I was the only data scientist there. The company was doing well, landing contracts from renowned Australian brands. The projects mostly involve taking data resources form clients data warehouse or data mart (occasionally from source database, which is crazy) building customer views and setting up online, mainly email, marketing campaigns.

My role was supposed to spice the company’s products with artificial intelligence. In a couple of projects I developed customer clustering models. They group customers into natural clusters based on facts, including demographics, focusing primarily on transactional interactions with the brands. For instance purchase frequencies and volumes. The knowledge learned by the algorithm from data informs the clients about the patterns of behaviours in their customers and helps them tailor messages used in the campaign to different cohorts.

Another type of models useful in marketing campaigns I built is churn prediction. Knowing how likely to lose a customer gives the business advantage to offer promotions or discounts to retain the customers at risk.

All sounds interesting then what is the problem?

Statisticians always warn us by saying “Garbage in, garbage out”. The quality of the data asset is vital to data science project. Interestingly but probably not surprisingly, I found, the managing personnel’s attitude to the value of data science models resonates with the maturity of the data they own.

On some occasions the model was built and deployed into production quite smoothly. On many others, it was built but we never heard from the clients about deployment. It also happened that the data was too poor to dream about any valid model.

Similar things happened in my second job. Once the stakeholders were interested in having a model predicting online traffic volume. Sadly it never turns up on the company’s roadmap.

My story might be discouraging for those aspiring to be data scientist but it happens for good reasons. In my opinion, the profound reason is data science modelling as a new comer resides at the end of the data processing pipeline. Normally a pipeline starts from reading data from source systems, transforms it, and stores it in data warehouse to serve reporting views or data marts and maybe the machine learning models.

There are easy to imagine consequences out of this topology. Developing a working model draws least attention during planning meetings, despite people might talk about it a lot when brainstorming for a project. For many projects in this country, as I observe, a machine learning application is something good to have but not essential. Presumably, this is largely influenced by today’s decision-makers who received their education when informative dashboard reporting business performance was the universe. Take-off of the algorithm based models in business will need patience and time.

While I was struggling to prove my value as a data scientist, another role has been too busy to argue about their importance. They are the sometimes behind-the-scene heros who build the pipeline, backbone of any projects: the data engineers. Literally this happens in both of my jobs. My data engineer colleagues at the consulting firm get involved in all projects. How my second job ended is even more better an example to prove my point: I got redundant after an internal restructuring. After a few months even my boss didn’t survive the changes but the data engineer mate in my team kept his job safe before he quit for another job interesting him more.

Hopefully you find my story interesting and useful for your career consideration. I still strongly believe data scientist is the sexist job of the 21st century and I never stop acquiring knowledge for it. However, before this career fully pans out, getting data engineering skills and experiences helps you secure a job in this industry around data.

Please let me know if my story resonates in you or disagree with my opinions. Any criticism is welcomed. Leave your comments below. Peace.

24 February 2018

How to Drop only Local Duplicates

drop_duplicates_while_keeping_order

This notebook is available on GitHub.

Context

The problem encountered was more or less a customer journey. A customer may first do A, and then B, then B again, then C three times, then do B again twice. A series like ABBCCCBB. The goal is to remove the duplicates found in the neighbouring events. If there is another element between two same elements, the two elements are not duplicates. In our example above, we want the final result as ABCB. We try to achieve this with drop_duplicates method of Pandas data frame. In this blog, I would like to share the frustration and lesson I learned from solving this problem.

In [1]:
%matplotlib inline
import pandas as pd
import numpy as np

Problem

Unfortunately, there isn't a built-in function in Pandas that can do that. Let's first see what we get with the built-in function.

In [2]:
dat = pd.DataFrame({'event': list('ABBCCCBB')})
dat
Out[2]:
event
0 A
1 B
2 B
3 C
4 C
5 C
6 B
7 B
In [3]:
dat.drop_duplicates()
Out[3]:
event
0 A
1 B
3 C

Solution in a Simple Case

The method clearly counts there are only three unique elements. However, it drops the trailing B's we actually want to keep. One vital observation leading to the solution is if a cell differs from the cell above, then it should be kept. So let's shift the data frame downwards so we can see a cell and the cell above in the row.

In [4]:
dat.loc[:, 'event_shifted'] = dat.event.shift()
dat.loc[:, 'is_different'] = dat.event != dat.event_shifted
dat
Out[4]:
event event_shifted is_different
0 A NaN True
1 B A True
2 B B False
3 C B True
4 C C False
5 C C False
6 B C True
7 B B False

Therefore, if I keep only the rows that is_different is true, problem would be solved.

In [5]:
dat.loc[dat.is_different, ['event']]
Out[5]:
event
0 A
1 B
3 C
6 B

Retrospectively, this totally makes sense as the range of our defination for "duplicates" is limited to the row above rather than the whole column as is assumed in drop_duplicates.

A Little More Complex Case

While this simple solution works magically, what about the situation of multiple columns. For example, instead of one we now have to customers. Running the same solution occasionaly will err. Like the example below.

In [6]:
dat = pd.DataFrame({'event': list('ABBCCCBBBBCCCBB'), 'customer_id': [1]*8 + [2]*7})
dat
Out[6]:
customer_id event
0 1 A
1 1 B
2 1 B
3 1 C
4 1 C
5 1 C
6 1 B
7 1 B
8 2 B
9 2 B
10 2 C
11 2 C
12 2 C
13 2 B
14 2 B
In [7]:
dat.loc[:, 'event_shifted'] = dat.event.shift()
dat.loc[:, 'is_different'] = dat.event != dat.event_shifted
dat.loc[dat.is_different, ['customer_id', 'event']]
Out[7]:
customer_id event
0 1 A
1 1 B
3 1 C
6 1 B
10 2 C
13 2 B

The first event of customer 2 was removed as is the same as the last event of customer 1. Therefore customer id should also be compared.

In [8]:
dat = pd.DataFrame({'event': list('ABBCCCBBBBCCCBB'), 'customer_id': [1]*8 + [2]*7})
shifted = dat.shift()
is_different = (dat.customer_id != shifted.customer_id) | (dat.event != shifted.event)
dat.loc[is_different]
Out[8]:
customer_id event
0 1 A
1 1 B
3 1 C
6 1 B
8 2 B
10 2 C
13 2 B

Now we got the correct final data set.

What Learned

We find a simple solution to drop duplicates only across neighbouring rows. We fully implement Pandas built-in methods or functions. No iteration through the rows, which means fast speed.

30 July 2017

Count Number of Customers at Certain Time Points with Merge_asof

This notebook is available on GitHub.

This will probably look like a trivial example but message I would like to deliver is merge_asof is an awesome tool to merge time series datasets.

It's fairly new, recently added from probably 0.19.0.

First off, let's create some dummy data.

In [1]:
import pandas as pd
In [2]:
signup = pd.DataFrame(pd.date_range('2016-01-01', '2017-01-01', freq='1m'), columns=['signup-date'])
In [3]:
signup
Out[3]:
signup-date
0 2016-01-31
1 2016-02-29
2 2016-03-31
3 2016-04-30
4 2016-05-31
5 2016-06-30
6 2016-07-31
7 2016-08-31
8 2016-09-30
9 2016-10-31
10 2016-11-30
11 2016-12-31

Now your task is to find how many customers you had at the time points as the following.

In [4]:
check_date = pd.DataFrame(
    [pd.datetime(2016, 4, 17), pd.datetime(2016, 5, 15), pd.datetime(2016, 6, 10)],
    columns=['check-date']
)
check_date
Out[4]:
check-date
0 2016-04-17
1 2016-05-15
2 2016-06-10

With merge_asof, the joining keys don't have to be equal. By definition, it's a left join. With the default set-up, each row joins to a row in the right dataframe that has the biggest value for the joining key but no greater than the value in the left dataframe.

Hope you can wrap your head around what I just said. Fortunately it's easy to explain it with an example.

Let's merge check_date and signup and focus on the first row having date '2016-04-17'.

When merging happens, it looks up all the dates in signup. Eventually, it finds '2016-03-31'. This is because '2016-03-31' is the last date in signup earlier (smaller) than '2016-04-17'.

Have a look at the mergin results.

In [5]:
pd.merge_asof(check_date, signup, left_on='check-date', right_on='signup-date')
Out[5]:
check-date signup-date
0 2016-04-17 2016-03-31
1 2016-05-15 2016-04-30
2 2016-06-10 2016-05-31

With this in mind, we only need one more column showing the number of customers after the customer in that row signed up.

In [6]:
signup.loc[:, 'count'] = list(range(1, len(signup)+1))
signup
Out[6]:
signup-date count
0 2016-01-31 1
1 2016-02-29 2
2 2016-03-31 3
3 2016-04-30 4
4 2016-05-31 5
5 2016-06-30 6
6 2016-07-31 7
7 2016-08-31 8
8 2016-09-30 9
9 2016-10-31 10
10 2016-11-30 11
11 2016-12-31 12

Then merge_asof will give us the count at the date requested.

In [7]:
pd.merge_asof(check_date, signup, left_on='check-date', right_on='signup-date')
Out[7]:
check-date signup-date count
0 2016-04-17 2016-03-31 3
1 2016-05-15 2016-04-30 4
2 2016-06-10 2016-05-31 5

There are still a lot more options with which you can tune the behaviour of merge_asof to fit your merging goal. For example, it is possible to join one row in the left dataframe to multiple rows in the right dataframe for time series where you can define a tolerant range, like "with 10 days".

In [8]:
pd.merge_asof(check_date, signup, left_on='check-date', right_on='signup-date', tolerance=pd.Timedelta('10days'))
Out[8]:
check-date signup-date count
0 2016-04-17 NaT NaN
1 2016-05-15 NaT NaN
2 2016-06-10 2016-05-31 5.0

Can you see that the first two rows couldn't find any within 10 days backwards?

Now it's your time to explore the usefulness of this tool!