Testing: Because we can’t spell BUGS without U 🐞


Testing is a way of quality assuring our work when creating software that is important. Quality means a lot of things and for the scope of this post I will mean reliability mainly. Martin Fowler has a great article on why high quality software is worth the effort.

fiiiire

So in this post I wanted to talk a little bit about the content of a talk I did recently which was about…testing. During the day of the talk after discussing the opinions and pain points of other people in goverment I have since updated it a bit in order to reflect certain points that were discussed.

Testing is an essential part for work that fits the Reproducible Analytical Pipelines (RAP) concept. For more information about RAP , a great cross-government initiative, please have a look below:

So… lets get started with some definitions of testing so as to speak the same language:


Unit testing


Test your functions with specific inputs and expect specific outputs. Start with your testing here. 😊 Unit tests are low level,focusing on testing individual methods and functions used by your scripts.

There are very good testing frameworks that make the creation of tests convenient and not a drag. Pytest is recommended for Python (and testthat for R). There are also many Pytest plugins that add new functionality such as coverage calculation, parallelizing tests,including time information in tests (ie fail a test if it takes more than 1 min to finish!) etc


Unit testing (automated with parameterization)


Same as above but include more than one set of inputs and outputs. The more the merrier!!! Parametrized tests will be called multiple times, each time executing the set of dependent tests. Test functions usually do not need to be aware of their re-running.


Integration testing (less tests that unit tests)


Integration tests verify that different functions or pipeline steps used by your application work well together. So it is testing performed to expose defects in the interfaces and in the interactions between pipeline steps. Its important to do this for larger pipelines. Sometimes especially when different people create different steps of the pipeline and perhaps specification information is lost in translation (great movie by the way), this is a way to ensure .. integration!


End to End Testing (even less tests but to test whole pipeline)


There may still be emergent issues when the whole pipeline is used together hence E2E testing is sometimes needed.


… But will I need all this testing as a data scientist? I’m not creating enterprise software


You definately need to unit test the functions you are creating. You dont need to test everything under the sun. Just your functions. Do lots of unit tests. 😊

When you are using loads of different functionality from different packages/different systems/different steps you can test this and you could think of it as an integration test. Dont do a million of these. Just enough to show that what you think should happen should happen from one function to another.

Also in a pipeline perhaps you could test successive possible steps as an integration test.

Finally it would be nice to have a few end-to-end tests from the beggining to the end of a pipeline so you can perhaps catch things apparent only then!

Chris Wallace , a data scientist working for the Cloudera Fast Forwards Labs has a nice post on why tests are needed for data science work from his own point of view. You might find it interesting and its quite relevant so click here to read it.

Language types

Certain data science languages like R or Python are not staticly-typed languages. This makes them sometimes easier but it does place the onus on the programmer to conduct more tests to ensure bugs are not introduced.What I am trying to say here is that in some other languages certain errors (related to wrong types) would not compile at all.

In Python and R there might be a chance to do so and this means that you rely more on your tests and less on the interpreter to make sure your code is correct.On the other hand Python and R are easier to use!

So… you win on the swings and lose on the roundabouts 🎡.

There is this discussion that is quite interesting about types. Also this blog post which describes many languages and where they stand as far as the typing classification goes.Perhaps it will be of interest.

PS. This is not a Python vs R vs $insertTrendyLanguge point. Both R/Python are great languages and help us perform a lot of great analytical work.It just helps to be aware of some of their weaker points so as to be prepared to deal with them.

Things that are not essential but are ‘Nice to Have’ :


Continuous Integration

Every time someone tries to integrate / add code to a codebase the following things need to happen:

  • merge new code with old code in a test environment
  • run a series of tests (many unit tests,less integration tests,few e2e tests)
  • if none of the tests is throwing an error then code is accepted to repo (or at least this is the point that perhaps code review can take place)

This is very useful for automatically checking your own work. Its immeasurably useful when 2 or more people collaborate on a codebase.


Coverage (useful metric but not end-all-be-all)


Coverage is a useful metric for code quality. It measures how much of your code has been used / executed through your tests. By having as high coverage as possible you ensure that at least that part has been tested a bit! It is useful as a metric. But perhaps not as something to blindly adhere to. Ie you could have high coverage but the tests can be simple and not cover all eventualities. One thing for sure is that it points out what has been NOT tested so in that sense its more useful :) I have a post here that talks about coverage in a practial way

Updated: