Wanna become a data scientist? Checkout Beyond Machine!
Team Data Science Process
I have talked in the past in this blog about the need for standardisation of data science and data science processes. Microsoft has made progress on that front. It has released a particular methodology called TDSP (Team Data Science Process). This is an attempt to formalise the way that data scientists work and collaborate.
In order to further support this methodology Microsoft recently released two very valuable tools for TDSP which you can find at this repository. The Modeling tool allows for automated test of different algorithms and IDEAR aids with exploration and reporting.
Both of the tools are in R. The modelling tool uses YAML to specify experiments, which makes it pretty convenient. The cases covered now are binary classification and regression, but I would expect more things to see coming up in the future. From the description of the documentation of TDSP, it is clear that Microsoft wants to create something like Agile, but for data science.
While I am not suggesting that everyone should follow the methodology created by Microsoft, I do believe it is a step towards the right direction. I expect methodologies to show up in the near future.
The importance of data science processes
I believe that data science processes is an important step that data science needs to undertake in order to improve faster adoption in more industries. Right now the field is very dispersed, largely due to the fact that data science is a combination of many disciplines: statistics, machine learning, AI and other fields (such as data mining). I have written in the past about the differences between AI and machine learning, and machine learning with statistics. There is a fragmentation in terms of training (how one becomes a data scientist), methods, and languages (e.g. R vs Python).
There have been some efforts to unify languages in the past through PMML (Predictive Model Markup Language), but this was never widely adopted. Furthermore, the work done by Microsoft is in the right direction, because it provides a more abstract framework of approaching data science problems. I don’t believe unification of tools will ever happen, since there are so many software libraries coming out every year. However, the overall process itself through which we define problems and collaborate to solve them can be unified, and could help in the adoption of best practices in analytics throughout all industries.