Cleaning Data with RefineMichael Bauer | 24 September 2013
Open Refine (previously Google Refine) is a data cleaning software that uses your web browser as an interface. This means it will look like it runs on the internet but all your data remains on your machine and you do not need internet connection to work with it.
The main aim of Refine is to help you exploring and cleaning your data before you use it further. It is built for large datasets – so don’t worry as long as your spreadsheets can hold the information, refine can as well.
To work with your data in Refine you need to start a new project.
Walkthrough: Creating a Refine project
- Start Refine – this will open a browser window pointing to http://127.0.0.1:3333 if this doesn’t happen open the link with your browser directly
- Create a new project: On the left tab select the “Create Project” tab:
You now have successfully created your first refine project. Remember: although it runs in a web-browser, the Refine server is still on your machine – all the data is there (so no worries if you handle sensitive information)
Once we created our project, let’s go and explore the data and the Refine interface a bit. Using Refine might be intimidating at first, since it seems so different from spreadsheets, once you get used to it you will notice how easily you can do things with it.
One of the commonly used functions in spreadsheets is sorting and filtering data – to figure out minima, maxima or things about certain categories. Refine can do the same thing.
Walkthrough: Sorting rows
- Refines handles data similar to a spreadsheet: you have rows, columns and cells – a cell is a field defined by a row and a column.
- To sort your rows based on a specific column click on the small downward triangle next to the column.
The other frequently used function in Spreadsheets is filtering – in Refine this is called facetting. Facetting in Refine is really powerful – you will see in most of the rest of the Recipe we’ll use facets.
Walkthrough: Facetting rows based on a column
- Select the column options for the column you want to facet with
Sometimes humans make mistakes when they enter data – they mistype city names or put in characters they can not see but the computer can. (For example, you can add a simple space at the end of a name and the computer will think they are different). For this let’s create a text facet for the cities:
Walkthrough: Reconciling Columns
- Create a text facet for the City column
Did you notice how most of the Cities are all uppercase? It’s rare to read them like this. And maybe you want to have nicer looking names: No problem. Refine supports this.
Walkthrough: Changing Case in Refine
- Let’s change the case in our city column from all uppercase to titlecase
- To do this, open the column options, go to edit cells -> common transforms -> to titlecase
- Tada – magically your names have been converted.
Congratulations! You successfully cleaned up a dataset using Refine!