đ§šCourse: Cleaning, preparing and transforming data with Tableau Prep
Up to now we've created solutions where the data has already been prepared for us, in this course we look at how we can clean, prepare and transform data to use in our analysis using Tableau Prep
For information on how Tableau Academy works please refer to our Welcome To Tableau Academy Guide.
This Weekâs Project
In most of the projects weâve covered so far the data used has already been prepared for you, in reality a large part of data analysis and visualisation involves gathering and preparing the data youâll need to use so this week we look at using Tableau Prep Builder to clean, prepare and transform date for the project.
The project involves creating a data set that can be used to analyse how much time and resource is being used across some key business change projects and will cover combining timesheet data from a number of sources, adding in HR data and then cleaning and transforming the data ready for analysis.
The Brief:
In your role as a Project Management Analyst youâve been asked to start providing insight into the allocation of peopleâs time to five key business change projects that are currently in progress at your company.
You have access to download data from your companies timesheet booking systems where every employee has to book how many hours theyâve worked on each specific project and activities within that project.
They also have to book any none project related time such as training, annual leave or other types of admin not, to enable this to be done a âdummyâ project has been created in the timesheet system to book this time against.
Unfortunately each region in your company uses a different timesheet booking system, while they are all from the same supplier there is no connectivity or interaction between the individual regions systems so youâll need to combine multiple data sources.
The Business Change leadership team have also asked if itâs possible to get data that includes the following as this will form part of the ongoing analysis and reporting of this data:
To be able to see how much time has been booked to each individual project and activity each month.
To be able to cut this by countries, regions, job roles and individual people.
To get an estimate of the relative costs of the time, using the average Full Time Employee (FTE) costs per job role in each region as supplied in the HR file.
To be able to attribute the time/costs booked against the none project related âdummyâ project to the projects each person works on.
To not include anyone from the communications team that performs a comms role as these are third party resources and their time/cost is already accounted for in the contract with the third party.
For the data to be updated each month to add the latest complete month to the data, or to be updated adhoc if required with the latest data available within a month.
For the second two points youâve been supplied with an Employee Data file from HR which contains a list of relevant employees and the required details as well as a list of the average FTE costs for each role in each region.
For point 4 itâs not possible to get the data in this way from the timesheets data or to update the timesheet systems to provide it so youâre going to need to find a way to transform the data to provide the required info.
The last point is critical to this project, as a one off this could be done manually but with the requirement to update it every month or on demand an easily repeatable solution is required.
Required Data:
The files you need for this can be found in the Course Data Files folder for this project, as this course is focussed around preparing data weâll explore the data and files as we go through so for now you just need to make sure youâve downloaded the files in the above folder.
Additional Information:
This project uses Tableau Prep Builder, this is included as part of your Tableau Desktop or Creator license. If you donât currently have a license or are using Tableau Public you can get a free 14 day trial of Tableau Prep to complete this course (make sure you donât activate it until youâre ready to start though!)
More info on Tableau Prep Builder can be found here: https://www.tableau.com/en-gb/products/prep
And the latest version can be downloaded here: https://www.tableau.com/en-gb/products/prep/download
Tableau Prep offers an easy to use, drag and drop way to clean, prepare and transform data ready for analysis and be able to repeat these steps, in organisations you may find that this can be done directly within your databases/data warehouse using SQL however itâs still useful to know what can be done with Tableau Prep since itâs part of your Tableau license anyway.
Tableau Prep also allows you to take care of preparing data yourself if you donât have the access or expertise to write SQL code to do this, itâs also useful for quickly preparing data for one off exercises, proof of concepts or where you canât make changes to the underlying data sources like in the scenario weâre looking at.
Course Files:
You can download the files required for this project from here:Â Tableau Academy Course Files
In particular you will need to download and save the source data files from the the Course Data Files folder
Weâve also included the data files that are output as part of this course and the Tableau Prep âflowâ files used to create them for your reference.
Solution
The steps for the solution are shown below, a short video follows each section showing that part of the solution being done. Additional info is shown in highlighted paragraphs like this one.
The first time you come across a new requirement (e.g. to create a calculated field) detailed instructions are provided, in subsequent steps these donât tend to be repeated unless required. We highly recommend watching the accompanying video with each section to see exactly how each step works and what you need to do.
The steps below detail the solution created for this and how it was built for you to follow through.
Step 1: Combine the timesheet data exports from each region into one file
Step 1.1 - Check the timesheets data files
To begin with we want to have a look at the source files weâve got to see what we need to do with them.
If we open the three âTimesheets_ âŚâ files we notice:
Thereâs a separate file for each region, weâll want to join these together to create a single table with the timesheet data weâll use.
Within the Timesheets_Americas.csv file the field that contains the booked hours is entered as text in the format of â15 hoursâ, weâll need to remove the text and convert to a number.
The Timesheets_APAC.xlsx file has a sheet per country which weâll need to combine together.
The Details field in each file contains a combination of the month, project and activity which weâll need to split out to be able to use that data.
Step 1.1 Video (Note: video has no sound, click the full screen button to see a larger version):
Step 1.2 - Combine the APAC files into one
The first thing weâre going to do is combine our APAC file which is split by country into one table (APAC is short for Asia Pacific and refers to countries in this region), this will mean we can apply our cleaning and preparing steps just once on the full table rather than having to repeat the same step for multiple files.
Open Tableau Prep Builder and click connect to data > To a File > Microsoft Excel.
Locate the Timesheets_APAC.xlsx file that you should have already downloaded and click Open.
On the blue bar on the left you should now see a table for each of the Asia Pacific countries.
Click the first table named Hong Kong and drag it onto the Add Data pane.
Once you drop the table onto the pane a grey Input step window should appear in the bottom half of the screen, this is where we set how to load the data in from our file.
On the left hand side of the grey window the tab for Multiple Files should have opened, if not click on this tab.
Select Wildcard union from the two options available.
The wildcard union is where you tell Tableau Prep how to find files with a similar name etc. so it can load them without you having to specify each individual file one by one. If you have a look at some of the options youâll see this is pretty flexible and can be used to locate files in a particular folder, that have a similar naming or follow a particular naming convention.
Note: We could use this to load the additional files we want but we want to load them in separate steps so for now weâre just going to use this to load in all of the individual country sheets from the APAC file.
Tableau Prep should have automatically found the five sheets within the file weâre loading, you donât need to change any options but you should see that itâs going to pull in all sheets from the file weâve specified, at the bottom you should see five sheets each with a different country name under the Include files / Include sheets list.
Click the blue Apply button
Rename the step you just created by right clicking the blue input step that appeared on the pane and clicking Rename.
Name it as âLoad APAC Fileâ, you can add a description as well if you want.
Once weâve created an input step we can then use the Tableau Prep functions to do a lot of cleaning, preparing and transforming our data. For this step weâre going to look at some basic cleaning options that removes data we donât want in our end file.
Click the blue + sign next to the step you just created .
Select Clean Step from the menu. that appears.
A new step should now have appeared in your data pane called Clean 1, the pane below should display the columns and data in your table, to begin with weâre just going to remove a couple of columns from this table.
Scroll across to the Tableau Names field and click it
Some options icons should now appear in the column header, click the 3 dots (âŚ) to bring up the columns menu
Click the Remove option.
Repeat the same with the File Paths column to remove that too.
This has now removed those columns that we donât need from our data without changing the underlying data source files (if you were to open them up you should see they havenât changed).
Tableau Prep allows you to see each of the clean steps youâve created and undo or edit them if you need to as well, this is a great way of being able to track what changes have been made.
Click the Changes side bar at the left to expand it.
You should see the 2 remove steps you just created in here, if you decided you wanted to keep these columns at a later date you can simply remove the steps by clicking them and then clicking the x on the right hand side (later on weâll see how we can edit certain steps as well but with a remove step thereâs nothing we can edit).
Rename the Clean 1 step to Clean APAC Input and weâre done with loading the APAC timesheet data and can move on to the next file.
Step 1.2 Video (Note: video has no sound, click the full screen button to see a larger version):
Step 1.3 - Load and clean the Americas timesheets data
In this step weâre going to add in the Americas timesheet data and clean the hours booked data field so that it gives us the time as a number that we can use later on.
If the blue Connections bar isnât visible on the left click to expand it.
Click the + icon next to Connections at the top of the blue side panel.
Under To a File click Text file and locate the Timesheets_AMERICAS.csv file that you saved earlier.
This time we donât need to do anything with the input step so just rename the new input step thatâs appeared as Load Americas Data and add a new Clean step.
Notice that the Hours column is being brought in as a text string, this is because when the data has been entered itâs been entered as x hours. This means we canât do any calculations with the numbers so letâs clean this up.
Click the 3 dots at the top of the Hours column and hover over Clean >.
You should see that thereâs a few options for being able to quickly clean our data, theyâre all pretty self explanatory and weâll use some of the others later in the course.
For this step click Remove Letters.
Notice how Tableau Prep has gone through each entry and removed any letters for us?
We also had a space between the number and hours (you might not be able to see this immediately but if you double click one of the values in the column you should see the space at the end), weâll need to remove that before we can use the numbers,
Click the column menu and Clean > again
This time click Remove All Spaces.
We can now convert the columns format to numbers so that itâs the same as the fields from the other data sources.
Click the Abc icon at the top of the column
Select Number (decimal) from the menu that pops up.
That column should now have updated to a number format.
If you click the Changes tab again you can see the steps weâve just created. Notice that the first three have a pencil icon next to them, this means you can come back and edit those steps in the future if you need to without having to remove and re-create them.
Thatâs all we need to do with the Americas file so rename the Clean 2 step as Clean Americas Data and we can move onto importing our final timesheet file.
Step 1.3 Video (Note: video has no sound, click the full screen button to see a larger version):
Step 1.4 - Load the EMEA timesheets data
In this step weâll follow the same process for loading our final timesheets file in from the EMEA region (EMEA stands for Europe, Middle East and Africa and covers countries in these regions), in this step we donât need to do any cleaning though so this will be a simple loading of the data.
Click the + sign in the blue Connections side bar
Click Text file under To a File again and this time locate the Timesheets_EMEA.csv file.
Rename the step as Load EMEA Data
Add a new Clean step and rename it as Clean EMEA Data.
We donât actually need to do any cleaning in this step but itâs good practice to add a clean step after an input step as this makes it a lot easier if you do need to do any cleaning of the file in the future or replace the input file.
Step 1.4 Video (Note: video has no sound, click the full screen button to see a larger version):
Step 1.5 - Combine the three regions timesheets data into a single table
Weâve now loaded the timesheets data for each region and done any cleaning that is relevant for each specific file, in this step weâre going to combine them together into one table that we can then continue to clean and prepare.
Note: Combining the files into one table means you can just set up and do the required cleaning once rather than having to set it up for each individual file, this makes sense when the same steps need applying to data from multiple sources/files.
Click the Clean APAC Input step and drag it over the top of the Clean Americas Input step (Note: Minimise the grey panel to make it easier to work with the steps in the pane, you can expand it again when you need to do some work on an individual step).
A couple of options should appear as you drag the Clean APAC Input step over the Clean Americas Input step, keep dragging until the Union option is highlighted and then drop the Clean APAC Input here and a new Union 1 step should appear on your pane.
The union step will try to combine two data tables where they have the same fields, weâll look at the options for this in a minute but first we want to add the EMEA data to it.
Click the Clean EMEA Data step and drag this on top of the Union 1 step
This time drop it on the Add option when it appears.
The steps should now show a line from each of the Clean.. steps to the new Union step.
We use the Add option in the second step as weâre adding to the union step we already created, if we dropped it onto the Union option itâd created another new step which we donât want here.
Click the Union 1 step and make sure the grey info panel at the bottom of the screen is expanded.
With a union step Tableau Prep will create a combined table from the specified data tables, when the individual tables have the same column names and number format Tableau Prep will automatically combine them.
On the Settings tab that should be visible you should be able to see that Tableau Prep has been able to identify that the Employee, Details and Hours columns are the same in the Americas and EMEA data so has combined those columns.
It hasnât been able to determine the relevant columns from the APAC file as they have different names so we need to match them up manually
Note: An alternative option would be to rename the columns in the clean step so they all have the same naming, this is what weâd normally do but weâve left it this way so we can show you how to match columns manually here.
Under the Mismatched Fields first click Employee (The first square should be empty and the last two showing the colours assigned to the Americas/EMEA files).
Hover over Name and then click the + icon that should appear (The Employee field needs to be highlighted with a blue outline for this to appear).
This should now disappear as the fields have been matched across all three files.
Do the same with Details > Info and Time Booked (Hrs) > Hours to match the remaining fields.
Weâve now got a table with the timesheet data from all of our input files, rename the Union 1 step to Combine Timesheet Data.
Step 1.5 Video (Note: video has no sound, click the full screen button to see a larger version):
Step 1.6 - Finish cleaning and preparing the timesheet data
Our final step in this section is to finish off cleaning and preparing the timesheet data, as these steps apply to all of our input files we can do them on the combined data, if we hadnât combined out data weâd have to repeat this on each of our files/sheets.