Adding new data sources to any analysis is a common activity. Every new data source, no matter how curated, needs additional data prep. In this post, we will demonstrate our standard data preparation steps of new data sources in Dataiku.
One of the many benefits of data science is the breaking down of data silos. This is achieved by bringing in data from different sources into the same platform for analysis. Dataiku makes this activity extremely easy. Dataiku has connectors to most data platforms and file types (full list here). Helpful note: I prefer to bring any new data sources in their most raw form.
As a Citizen Data Scientist, I get the first peek at the new data set. It's tempting to start joining it with other data right away. However, prior to joining data sets, it is best practice to do some standard data preparation steps first.
These steps will save you a significant amount of time and headaches in the future. In this blog, I will be demonstrating these steps in Dataiku. However, these steps can be followed no matter what tool you are using!
1. Create a Prepare Recipe in Dataiku
2. Clean the Empty Columns
When you see an empty column always check with business users if this was intentionally left blank or data wasn’t properly brought into Dataiku upon upload.
A quick way to identify empty columns is in the Explore tab. Go to list view and look for those without any valid data. Confirm it is not just your sample without data.
3. Rename Columns
Renaming column is especially important for those that are:
- Duplicate columns names, but different data content
- Name that may not be descriptive enough such as using Create Date vs. Date
- Special characters or spaces on columns you want to use formulas on in the future such as “Total_Assets” instead of “Total Assets ($)”
Rename columns by clicking on the column name and selecting Rename
4. Parse Dates
Having dates parsed will allow for easier charting and date manipulation such as time between dates.
5. Review Data Types and Meanings
The data quality bar will identify which rows are valid for the data meaning in this sample. The data is being stored as the storage type. It will also identify the percentage of the sample without any data. For some situations with invalid or empty data you will remove the entire row, others will leave this field blank or replace with the accurate data.
In this example, PrecipTotal is being stored as a string, but the meaning (inferred from the data) is decimal. Assuming the meaning of decimal is true, we see there is some data that does not match this meaning (“T”). This is an opportunity to update the data storage type if the data is valid, or remove the data inconsistent with the mean.
In this post, we’ve covered the basic prep recipe steps you should do when bringing in any new data set. These steps will set you up for success in preparing your data for everything from descriptive statistics to machine learning.