Testing ML Algorithms
. it is better to have separate data preparation and the algorithm, it makes testing easier
. check if outputs are persisted correctly. (I use Spark at work. Weird things can happen during spark run, especially spark run in parallel processing)
. log every step. It is much easier to see which step has issues and which step takes the longest.
. test with a benchmark dataset, add more cases as you continue to develop your algorithm. When you develop a supervised algorithm,
It is good to have labeled data and test the algorithm’s efficacy on each segment of data.
But if you are testing an unsupervised algorithm, you should at least monitor some efficacy metrics, or produce outputs in a way that is easy to judge the quality of the algorithm.
(for example, if you are testing a clustering algorithm, you can create a benchmark dataset with some pre-made clusters that you know, along with some noises).
. test data with increasing training dataset to see how efficacy evolves.