Integrating Dataiku and PyCharm for Python Development

In this post, we’ll walk through the configuration and setup of PyCharm and a Dataiku DSS Design node as well as code modification and the execution of a debugging session.

Intro

In the ever-expanding Data Science universe, the need for powerful, collaborative team tools has never been more essential. In the past 5 years, Dataiku has emerged as a market leader in Data Science orchestration and is (in our opinion) the absolute leader in including technical and non-technical roles in collaborative projects. For technical users, Dataiku provides the power of Jupyter notebooks out of the box for Python code development — but for those requiring more flexibility, it’s also possible to integrate directly with the PyCharm IDE and debug on your desktop.

In this post, we’ll walk through the configuration and setup of PyCharm and a Dataiku DSS Design node as well as code modification and the execution of a debugging session.

Note: Dataiku integration with PyCharm will not be possible if you’re using the Free (Community) edition of Dataiku, which does not allow for API integration.

 

DSS Project

The Dataiku project that I’ll be demonstrating has a single Python recipe that we’d like to connect to with PyCharm and debug interactively. I’d like to point out however that with this integration, we can also edit DSS Plugins, SQL Recipes and R Recipes with PyCharm.

Image for post
DSS Python Recipe

Install the Extension

Assuming you already have PyCharm installed (the Community Edition will work just fine for this tutorial), the first step required to integrate Dataiku is to install the Dataiku Plugin in from the marketplace. Simply search for “dataiku” and then click “Install” on the Dataiku DSS plugin result.

Image for post
Dataiku PyCharm Plugin

Once you have the extension installed, you’ll want to configure it to integrate with your Dataiku Design Node instance. To do this, open the preferences panel as shown in the following screenshot and navigate to the “Dataiku DSS Settings” tab. From this screen, click the “+” to add a new DSS Instance.

Image for post

The “Base URL” is simply the http(s) url to the Dataiku Design node you’d like to connect, and for those unfamiliar, the API key is the “Secret” value found under the “API keys” tab in your Dataiku user profile.

Image for post
Dataiku personal API keys

It’s all about the Environment

If you’ve done much Python development, you’re likely familiar with virtual coding environments. In Dataiku, we’re able to create custom Code Environments for our plugins and scripts using the Code environment tab in the Administrative panel. When we integrate with a DSS Design Node instance from PyCharm, we want to create (at least) one PyCharm project for each DSS Code Environment in order to run our code locally.

With the DSS plugin installed, we’ll create a new project in PyCharm that will align with one of the Code Environments you in your DSS instance. For example, in my instance, I have a Python 3.6 environment with no custom PIP packages installed. To re-create this environment locally, I’m going to start a new PyCharm project and select the “New Virtual Environment using Virtualenv” option, using the same Python version that I’m using in DSS. If you are using Python 2.7 in your DSS environment for example, you’ll want to select a path to a local Python 2.7 install.

Image for post
 

Create a PIP requirements file

Now that we’ve created a PyCharm project with a new Virtual Environment, the next step is to create a requirements file which will define the PIP packages that are required to debug our files.

In your project, create a new text file with the File=>New=>File menu. In this file, we’ll be putting a couple essential packages needed for this integration, along with all of the PIP packages used in the DSS code environment that is used by your remote script. Name this new file requirements.txt

Listed below are the packages that are essential to run any local environment. Again, you’ll also want to add any additional PIP packages used by your Python recipe. Add these lines to your requirements.txt file.

dataiku-api-client
pandas
numpy==1.19.3 #
for windows, MUST be 1.19.3 until bug fix in 1.19

With this file created, we’re now ready to open a file from our remote DSS instance in PyCharm.

 

Opening a Python Recipe from DSS

With the DSS Plugin installed, pulling Python files from your Dataiku flow is integrated right into the PyCharm menu. Navigate to File=>Open Dataiku DSS, and if your plugin has been configured correctly, you should see windows similar to the screenshots below which allow you to navigate through your DSS project recipes and select a Python script.

Image for post
Opening a Recipe from DSS instance

When you’re selecting a file, be sure to leave the “Generate Runtime Configuration(s)” checkbox selected and also select the “Install” button as shown in the above screenshot which will install the necessary Dataiku API PIP packages into your local Virtual Environment.

Image for post
Installing the Dataiku Client library

With that complete, click “Finish” and your remote Python file should be opened in PyCharm!

 

Debugging in PyCharm

Now that you have your Virtual Environment set up locally and the remote file open, you’ll notice that PyCharm is prompting to install the missing requirement that we’ve defined in the requirements.txt file. Go ahead and click “Install requirement” to add these packages to your local environment.

Image for post
Installing packages from requirements.txt file

Before we move to the next step, I’ll point out a key piece of the puzzle in the PyCharm integration. If you navigate to the Run=>Edit Configurations menu, you’ll see the debugging configuration that has been generated by the Dataiku plugin. In this configuration, the Environment variables has been populated with the DKU_CURRENT_PROJECT_KEY key set to the name of your selected Dataiku project. This is a very nice feature, making things a bit easier than the Visual Studio Code integration — but be aware that this debug configuration is specific to this DSS project.

Image for post
Debugging Configuration project key

Start Debugging

With the Python file open, add a breakpoint somewhere in your code and click the Run => Start Debugging menu item. This will start a debugging session which have full access to your DSS datasets and, if configured correctly, will run in a fully interactive debugging session allowing you to pull data from your DSS instance!

Image for post
 

File Modifications

Of course debugging is one of the exciting features available with this integration, but we can also make local edits to the Python file and seamlessly save them back to our Dataiku instance.

To configure this integration, let’s refer back to the DSS Settings panel in PyCharm. In this panel, you’ll see a “Automatic synchronization” option which will determine whether the changes you make to files in PyCharm will be sent immediately to the DSS server or will require manual synchronization. If you prefer to send your file changes to the Design node manually, uncheck the “Automatic synchronization” box.

Image for post
Dataiku automatic synchronization

Once you’ve verified the synchronization setting, you should be able to modify your local version of the remote Python file. If you’ve chosen to synchronize manually, you can send your modifications to the server by selecting the File=>Synchronize with DSS menu option.

Image for post
File=>Synchronize with DSS

Recap

In this post, we’ve covered the configuration and setup of PyCharm and a Dataiku DSS Design node for execution of a debugging session and Python code editing — extending the powerful capabilities of DSS to the desktop. For an alternative IDE debugging experience, check out this article on setting up Visual Studio Code integration with Dataiku.