🧹Course: Cleaning, preparing and transforming data with Tableau Prep
Up to now we've created solutions where the data has already been prepared for us, in this course we look at how we can clean, prepare and transform data to use in our analysis using Tableau Prep
For information on how Tableau Academy works please refer to our Welcome To Tableau Academy Guide.
This Week’s Project
In most of the projects we’ve covered so far the data used has already been prepared for you, in reality a large part of data analysis and visualisation involves gathering and preparing the data you’ll need to use so this week we look at using Tableau Prep Builder to clean, prepare and transform date for the project.
The project involves creating a data set that can be used to analyse how much time and resource is being used across some key business change projects and will cover combining timesheet data from a number of sources, adding in HR data and then cleaning and transforming the data ready for analysis.
The Brief:
In your role as a Project Management Analyst you’ve been asked to start providing insight into the allocation of people’s time to five key business change projects that are currently in progress at your company.
You have access to download data from your companies timesheet booking systems where every employee has to book how many hours they’ve worked on each specific project and activities within that project.
They also have to book any none project related time such as training, annual leave or other types of admin not, to enable this to be done a ‘dummy’ project has been created in the timesheet system to book this time against.
Unfortunately each region in your company uses a different timesheet booking system, while they are all from the same supplier there is no connectivity or interaction between the individual regions systems so you’ll need to combine multiple data sources.
The Business Change leadership team have also asked if it’s possible to get data that includes the following as this will form part of the ongoing analysis and reporting of this data:
To be able to see how much time has been booked to each individual project and activity each month.
To be able to cut this by countries, regions, job roles and individual people.
To get an estimate of the relative costs of the time, using the average Full Time Employee (FTE) costs per job role in each region as supplied in the HR file.
To be able to attribute the time/costs booked against the none project related ‘dummy’ project to the projects each person works on.
To not include anyone from the communications team that performs a comms role as these are third party resources and their time/cost is already accounted for in the contract with the third party.
For the data to be updated each month to add the latest complete month to the data, or to be updated adhoc if required with the latest data available within a month.
For the second two points you’ve been supplied with an Employee Data file from HR which contains a list of relevant employees and the required details as well as a list of the average FTE costs for each role in each region.
For point 4 it’s not possible to get the data in this way from the timesheets data or to update the timesheet systems to provide it so you’re going to need to find a way to transform the data to provide the required info.
The last point is critical to this project, as a one off this could be done manually but with the requirement to update it every month or on demand an easily repeatable solution is required.
Required Data:
The files you need for this can be found in the Course Data Files folder for this project, as this course is focussed around preparing data we’ll explore the data and files as we go through so for now you just need to make sure you’ve downloaded the files in the above folder.
Additional Information:
This project uses Tableau Prep Builder, this is included as part of your Tableau Desktop or Creator license. If you don’t currently have a license or are using Tableau Public you can get a free 14 day trial of Tableau Prep to complete this course (make sure you don’t activate it until you’re ready to start though!)
More info on Tableau Prep Builder can be found here: https://www.tableau.com/en-gb/products/prep
And the latest version can be downloaded here: https://www.tableau.com/en-gb/products/prep/download
Tableau Prep offers an easy to use, drag and drop way to clean, prepare and transform data ready for analysis and be able to repeat these steps, in organisations you may find that this can be done directly within your databases/data warehouse using SQL however it’s still useful to know what can be done with Tableau Prep since it’s part of your Tableau license anyway.
Tableau Prep also allows you to take care of preparing data yourself if you don’t have the access or expertise to write SQL code to do this, it’s also useful for quickly preparing data for one off exercises, proof of concepts or where you can’t make changes to the underlying data sources like in the scenario we’re looking at.
Course Files:
You can download the files required for this project from here: Tableau Academy Course Files
In particular you will need to download and save the source data files from the the Course Data Files folder
We’ve also included the data files that are output as part of this course and the Tableau Prep ‘flow’ files used to create them for your reference.
Solution
The steps for the solution are shown below, a short video follows each section showing that part of the solution being done. Additional info is shown in highlighted paragraphs like this one.
The first time you come across a new requirement (e.g. to create a calculated field) detailed instructions are provided, in subsequent steps these don’t tend to be repeated unless required. We highly recommend watching the accompanying video with each section to see exactly how each step works and what you need to do.
The steps below detail the solution created for this and how it was built for you to follow through.
Step 1: Combine the timesheet data exports from each region into one file
Step 1.1 - Check the timesheets data files
To begin with we want to have a look at the source files we’ve got to see what we need to do with them.
If we open the three “Timesheets_ …” files we notice:
There’s a separate file for each region, we’ll want to join these together to create a single table with the timesheet data we’ll use.
Within the Timesheets_Americas.csv file the field that contains the booked hours is entered as text in the format of ‘15 hours’, we’ll need to remove the text and convert to a number.
The Timesheets_APAC.xlsx file has a sheet per country which we’ll need to combine together.
The Details field in each file contains a combination of the month, project and activity which we’ll need to split out to be able to use that data.
Step 1.1 Video (Note: video has no sound, click the full screen button to see a larger version):
Step 1.2 - Combine the APAC files into one
The first thing we’re going to do is combine our APAC file which is split by country into one table (APAC is short for Asia Pacific and refers to countries in this region), this will mean we can apply our cleaning and preparing steps just once on the full table rather than having to repeat the same step for multiple files.
Open Tableau Prep Builder and click connect to data > To a File > Microsoft Excel.
Locate the Timesheets_APAC.xlsx file that you should have already downloaded and click Open.
On the blue bar on the left you should now see a table for each of the Asia Pacific countries.
Click the first table named Hong Kong and drag it onto the Add Data pane.
Once you drop the table onto the pane a grey Input step window should appear in the bottom half of the screen, this is where we set how to load the data in from our file.
On the left hand side of the grey window the tab for Multiple Files should have opened, if not click on this tab.
Select Wildcard union from the two options available.
The wildcard union is where you tell Tableau Prep how to find files with a similar name etc. so it can load them without you having to specify each individual file one by one. If you have a look at some of the options you’ll see this is pretty flexible and can be used to locate files in a particular folder, that have a similar naming or follow a particular naming convention.
Note: We could use this to load the additional files we want but we want to load them in separate steps so for now we’re just going to use this to load in all of the individual country sheets from the APAC file.
Tableau Prep should have automatically found the five sheets within the file we’re loading, you don’t need to change any options but you should see that it’s going to pull in all sheets from the file we’ve specified, at the bottom you should see five sheets each with a different country name under the Include files / Include sheets list.
Click the blue Apply button
Rename the step you just created by right clicking the blue input step that appeared on the pane and clicking Rename.
Name it as ‘Load APAC File’, you can add a description as well if you want.
Once we’ve created an input step we can then use the Tableau Prep functions to do a lot of cleaning, preparing and transforming our data. For this step we’re going to look at some basic cleaning options that removes data we don’t want in our end file.
Click the blue + sign next to the step you just created .
Select Clean Step from the menu. that appears.
A new step should now have appeared in your data pane called Clean 1, the pane below should display the columns and data in your table, to begin with we’re just going to remove a couple of columns from this table.
Scroll across to the Tableau Names field and click it
Some options icons should now appear in the column header, click the 3 dots (…) to bring up the columns menu
Click the Remove option.
Repeat the same with the File Paths column to remove that too.
This has now removed those columns that we don’t need from our data without changing the underlying data source files (if you were to open them up you should see they haven’t changed).
Tableau Prep allows you to see each of the clean steps you’ve created and undo or edit them if you need to as well, this is a great way of being able to track what changes have been made.
Click the Changes side bar at the left to expand it.
You should see the 2 remove steps you just created in here, if you decided you wanted to keep these columns at a later date you can simply remove the steps by clicking them and then clicking the x on the right hand side (later on we’ll see how we can edit certain steps as well but with a remove step there’s nothing we can edit).
Rename the Clean 1 step to Clean APAC Input and we’re done with loading the APAC timesheet data and can move on to the next file.
Step 1.2 Video (Note: video has no sound, click the full screen button to see a larger version):
Step 1.3 - Load and clean the Americas timesheets data
In this step we’re going to add in the Americas timesheet data and clean the hours booked data field so that it gives us the time as a number that we can use later on.
If the blue Connections bar isn’t visible on the left click to expand it.
Click the + icon next to Connections at the top of the blue side panel.
Under To a File click Text file and locate the Timesheets_AMERICAS.csv file that you saved earlier.
This time we don’t need to do anything with the input step so just rename the new input step that’s appeared as Load Americas Data and add a new Clean step.
Notice that the Hours column is being brought in as a text string, this is because when the data has been entered it’s been entered as x hours. This means we can’t do any calculations with the numbers so let’s clean this up.
Click the 3 dots at the top of the Hours column and hover over Clean >.
You should see that there’s a few options for being able to quickly clean our data, they’re all pretty self explanatory and we’ll use some of the others later in the course.
For this step click Remove Letters.
Notice how Tableau Prep has gone through each entry and removed any letters for us?
We also had a space between the number and hours (you might not be able to see this immediately but if you double click one of the values in the column you should see the space at the end), we’ll need to remove that before we can use the numbers,
Click the column menu and Clean > again
This time click Remove All Spaces.
We can now convert the columns format to numbers so that it’s the same as the fields from the other data sources.
Click the Abc icon at the top of the column
Select Number (decimal) from the menu that pops up.
That column should now have updated to a number format.
If you click the Changes tab again you can see the steps we’ve just created. Notice that the first three have a pencil icon next to them, this means you can come back and edit those steps in the future if you need to without having to remove and re-create them.
That’s all we need to do with the Americas file so rename the Clean 2 step as Clean Americas Data and we can move onto importing our final timesheet file.
Step 1.3 Video (Note: video has no sound, click the full screen button to see a larger version):
Step 1.4 - Load the EMEA timesheets data
In this step we’ll follow the same process for loading our final timesheets file in from the EMEA region (EMEA stands for Europe, Middle East and Africa and covers countries in these regions), in this step we don’t need to do any cleaning though so this will be a simple loading of the data.
Click the + sign in the blue Connections side bar
Click Text file under To a File again and this time locate the Timesheets_EMEA.csv file.
Rename the step as Load EMEA Data
Add a new Clean step and rename it as Clean EMEA Data.
We don’t actually need to do any cleaning in this step but it’s good practice to add a clean step after an input step as this makes it a lot easier if you do need to do any cleaning of the file in the future or replace the input file.
Step 1.4 Video (Note: video has no sound, click the full screen button to see a larger version):
Step 1.5 - Combine the three regions timesheets data into a single table
We’ve now loaded the timesheets data for each region and done any cleaning that is relevant for each specific file, in this step we’re going to combine them together into one table that we can then continue to clean and prepare.
Note: Combining the files into one table means you can just set up and do the required cleaning once rather than having to set it up for each individual file, this makes sense when the same steps need applying to data from multiple sources/files.
Click the Clean APAC Input step and drag it over the top of the Clean Americas Input step (Note: Minimise the grey panel to make it easier to work with the steps in the pane, you can expand it again when you need to do some work on an individual step).
A couple of options should appear as you drag the Clean APAC Input step over the Clean Americas Input step, keep dragging until the Union option is highlighted and then drop the Clean APAC Input here and a new Union 1 step should appear on your pane.
The union step will try to combine two data tables where they have the same fields, we’ll look at the options for this in a minute but first we want to add the EMEA data to it.
Click the Clean EMEA Data step and drag this on top of the Union 1 step
This time drop it on the Add option when it appears.
The steps should now show a line from each of the Clean.. steps to the new Union step.
We use the Add option in the second step as we’re adding to the union step we already created, if we dropped it onto the Union option it’d created another new step which we don’t want here.
Click the Union 1 step and make sure the grey info panel at the bottom of the screen is expanded.
With a union step Tableau Prep will create a combined table from the specified data tables, when the individual tables have the same column names and number format Tableau Prep will automatically combine them.
On the Settings tab that should be visible you should be able to see that Tableau Prep has been able to identify that the Employee, Details and Hours columns are the same in the Americas and EMEA data so has combined those columns.
It hasn’t been able to determine the relevant columns from the APAC file as they have different names so we need to match them up manually
Note: An alternative option would be to rename the columns in the clean step so they all have the same naming, this is what we’d normally do but we’ve left it this way so we can show you how to match columns manually here.
Under the Mismatched Fields first click Employee (The first square should be empty and the last two showing the colours assigned to the Americas/EMEA files).
Hover over Name and then click the + icon that should appear (The Employee field needs to be highlighted with a blue outline for this to appear).
This should now disappear as the fields have been matched across all three files.
Do the same with Details > Info and Time Booked (Hrs) > Hours to match the remaining fields.
We’ve now got a table with the timesheet data from all of our input files, rename the Union 1 step to Combine Timesheet Data.
Step 1.5 Video (Note: video has no sound, click the full screen button to see a larger version):
Step 1.6 - Finish cleaning and preparing the timesheet data
Our final step in this section is to finish off cleaning and preparing the timesheet data, as these steps apply to all of our input files we can do them on the combined data, if we hadn’t combined out data we’d have to repeat this on each of our files/sheets.