Data scientist is the sexiest job of the 21st century. Python is the default choice for many data scientists because it is the perfect tool for machine learning, artificial intelligence and complex data operations. Since you are reading this article, I assume you at least know a little bit about Jupyter Notebook, Kaggle, scikit-learn, pandas, Keras, Tensorflow, PyTorch, etc. Is data science a collection of buzzwords? I would say “kind of” for personal interest or simple research projects. However, if you work as a data scientist, solid understanding and formal training in computer science or software development are mandatory for effectively collaboration and production. Otherwise, a data scientist will be confused in the conversation with a Python developer. This conversation will probably go something like this:

Python developer: "You didn't check your code and your tests into master without a code review, did you?"

Data scientist: "Checked my what into what without a what?"

In data-driven product development, some skills are extremely important for a data scientist such as writing reusable code modular, commenting, version control, testing, and logging. In this article, you will be briefly introduced to automate unit testing in Python using 5 frameworks: pytest, nose, Robot Framework, zope.testing, Jasmine-py. They are widely used in different domains by data scientists and Python developers.

Unit testing is to invoke your code to test some properties. You should design unit testing for the expected behavior of your code, and successfully pass the testing or fail when the behavior is not produced. The unit could be an entire module, a single class or function, even a variable. The tested unit should be isolated from other codes that are not tested. Python developers care about the testing such as the connection of database, interactive behaviors on the web application, etc. Data scientists pay more attention to the manipulation of data. Some basic unit testing for data scientist includes but not limited to

  • Data type: what if a column that used to be an integer becomes a floating value?
  • Data structure: what if the data source changes the column order or name?
  • Data range: what if the age of a person is negative?
  • Data pattern: what if a numerical feature is not normal distribution but used in a linear regression model?
  • Logical errors: wrong indexed, etc.

If you are still confused about what unit testing is, I strongly recommend you to read this simple unit test example and then continue. You should at least know the Python standard library unitest and some assert methods.

The most popular testing framework is pytest having more than 3667 stars on GitHub. It can output the detailed info on failing assert statement, automatically discover the test modules not only in current directory but also in the whole Python package, allow to parametrize build-in fixtures to provide a fixed testing baseline across functions/classes/modules, run unittest and nose test suites, and more than 300 plugins for any applications. The function inc(x) and pytest function test_answer() are in

By simply typing pytest, pytest will automatically discover the test files and output the detailed failed assert result.

The nose framework has more than 1250 stars on GitHub. Although it has not updated for more than 3 years, it is still a popular framework. Comparing to pytest, it does not have any special advantages, but is slightly faster. However, pytest is parallelizable(threading + SMP support) and can be much faster in parallel testing mode. nose can automatically find the test models in the current directory, and you can also define the discovery rules via regex using the setup or configuration file. A nose output is shown as below with the same file. The default output has detailed failed assertion. I like more of the highlighted colorful and concise output in pytest. Unlike the good compatibility of pytest, nose cannot run tests in pytest style.

Robot Framework has more than 3000 stars on the GitHub. It is commonly used with Selenium WebDriver library to test the web application. It also has many libraries to test FTP, MongoDB, Android, as well as a lot of APIs to help make it as extensible as possible. It uses special data-driven, keyword-driven and behavior-driven approaches to make tests readable and easy to create. Data-driven style can be used to test the same workflow with varying input data. Test cases that describe some kind of workflow may be written either in keyword-driven or behavior-driven style. Robot Framework uses *.robot file to configure the testing. In the example python application of a calculator, several functions are tested using the robot file in the key-driven style. The value 3 at line 34 is changed to 3.5 to check the error info, and the testing result is shown as blow.

Each tested function has its corresponding result with a concise explanation for the failed testing. If you want to know more about the failed testing, details are in the report.html as shown below. The log and report are able to stream to Sentry for better teamwork.

The zope.testing package only supports traditional Python test styles like unittest and doctest, and not the radically simpler styles permitted by the more recent frameworks. But it does offer a powerful system of layers with which whole directories full of tests can rely on common setup code that creates once for the layer (rather than once for each test), the environment in which the tests need to run. It is not user friendly for data scientist, and is commonly used for professional developer of Zope projects.

Jasmine is a behavior-driven development testing framework for JavaScript. It does not rely on browsers, DOM, or any JavaScript framework. Thus, it's suited for websites, Node.js projects, or anywhere that JavaScript can run. Jasmine for Python uses the same configuration file, the same structure with Jasmine in other languages. Just install the package and instantly begin testing the JavaScript in your Python project. You can also run Jasmine tests from pytest with selenium and report the results. This framework is helpful for python developer not for data scientist.

Besides the 5 testing framework, you may also find much more frameworks like tox, mock, hypothesis, etc. There is no one solution or one approach that fits all. As a data scientist, I would recommend you start from pytest. After you learn more about unit testing to test more complex applications, you can consider to switch to other frameworks to make use of their advantages. For Python developers, it is more important to coordinate many people working on the same code. You may need to sync your GitHub projects and test your code automatically via a hosted continuous integration service like Travis CI, Jenkins, Buildbot, etc.

Programming Languages and Resources for Software Developers

The most common programming languages for software engineers are C, C++, Python, and Java. Also, for building native mobile Apps, iOS Swift and Java Android are used for building iPhone and Android Apps respectively.

Python coding is well suited for those interested in pursuing a career in software engineering; however, other options aresystem admin, web design and development and mobile App design and development. It is advisable to consult with an IT career counselor to understand what career options best fits your skills. For instance, if you want to be a software engineer, learning HTML and CSS might not fit the bill. Here is an excellent article for learning more on coding and technology career roadmap. Once you know what career path you wish to pursue, you can make a plan on what, when, and how to learn. There are lots of online resources for learning coding and technology in general. For teenagers and high school students, High School Technology Services offers variety of hands-on training. For adults and professionals, Coding Bootcamps and DC Web Makers Companies offer basic to advance project-based programming and technology classes.