Cleaning Data with Refine

Cleaning Data with Refine

This tutorial is an adaptation of the EcoLab's GeoJournalism Handbook. The original material can be found here

Open Refine (previously Google Refine) is a data cleaning software that uses your web browser as an interface. This means it will look like it runs on the internet but all your data remains on your machine and you do not need internet connection to work with it.

The main aim of Refine is to help you exploring and cleaning your data before you use it further. It is built for large datasets – so don’t worry as long as your spreadsheets can hold the information, refine can as well.

Creating a new Project

To work with your data in Refine you need to start a new project.

Walkthrough: Creating a Refine project

  1. Start Refine – this will open a browser window pointing to http://127.0.0.1:3333 if this doesn’t happen open the link with your browser directly
  2. Create a new project: On the left tab select the “Create Project” tab:
Cleaning Data with Refine

You now have successfully created your first refine project. Remember: although it runs in a web-browser, the Refine server is still on your machine – all the data is there (so no worries if you handle sensitive information)

Sorting and Facetting

Once we created our project, let’s go and explore the data and the Refine interface a bit. Using Refine might be intimidating at first, since it seems so different from spreadsheets, once you get used to it you will notice how easily you can do things with it.

One of the commonly used functions in spreadsheets is sorting and filtering data – to figure out minima, maxima or things about certain categories. Refine can do the same thing.

Walkthrough: Sorting rows

  1. Refines handles data similar to a spreadsheet: you have rows, columns and cells – a cell is a field defined by a row and a column.Cleaning Data with Refine
  2. To sort your rows based on a specific column click on the small downward triangle next to the column.

The other frequently used function in Spreadsheets is filtering – in Refine this is called facetting. Facetting in Refine is really powerful – you will see in most of the rest of the Recipe we’ll use facets.

Walkthrough: Facetting rows based on a column

  1. Select the column options for the column you want to facet with
Cleaning Data with Refine
  2. Select “Facet”

Reconciling Columns

Sometimes humans make mistakes when they enter data – they mistype city names or put in characters they can not see but the computer can. (For example, you can add a simple space at the end of a name and the computer will think they are different). For this let’s create a text facet for the cities:

Walkthrough: Reconciling Columns

  1. Create a text facet for the City column
Cleaning Data with Refine
  2. Scroll down where it says La Paz: see how many different ways there are to write la paz?

Making city names look nice

Did you notice how most of the Cities are all uppercase? It’s rare to read them like this. And maybe you want to have nicer looking names: No problem. Refine supports this.

Walkthrough: Changing Case in Refine

  1. Let’s change the case in our city column from all uppercase to titlecase
  2. To do this, open the column options, go to edit cells -> common transforms -> to titlecase

    Cleaning Data with Refine

  1. Tada – magically your names have been converted.

Congratulations! You successfully cleaned up a dataset using Refine!

 

About the Author
Michael Bauer lives in Vienna and works as a Data Wrangler with the Open Knowledge Foundation mostly around the School of Data. After a detour in biomedical research, where he learned to love datawrangling, he spent some time doing advocacy for his passion: freedom in the digital age. He joined the Open Knowledge Foundation to satisfy his curiosity. He will gladly jump on any topic that you point him to. – See more at:http://okfn.org/about/team/#Michael_Bauer_8212_Data_Wrangler_School_of_Data

By visiting EJN's site, you agree to the use of cookies, which are designed to improve your experience and are used for the purpose of analytics and personalization. To find out more, read our Privacy Policy