Blog
January 11, 2019
Big Data & Analytics
R&D

Garbage In, Garbage Out: How to Prepare Your Data Set for Machine Learning

Garbage In, Garbage Out: How to Prepare Your Data Set for Machine Learning

In the world of artificial intelligence, computer scientists juggle many different acronyms: AI for artificial intelligence, ML for machine learning, DL for deep learning and even CS for computer science itself. These commonly used and often linked terms all share the common thread of using data to build machines that are smarter, more efficient and more capable than ever before.

But in order for computers to take full advantage of AI and capabilities, there’s another acronym that computer scientists must be familiar with to build successful machines: GIGO, short for “garbage in, garbage out.”

Garbage In, Garbage out principle

For artificial intelligence, this means the quality of the output depends on the quality of the input. With bad data, applications with AI capabilities, such as chatbots or personal assistants, will produce results that are inaccurate, incomplete or incoherent. Having good data is especially important for AI subsets like machine learning and deep learning, which gain greater capabilities over time by analyzing large sets of data, learning from them and ultimately making adjustments that make the applications more intelligent.

Clean data and machine learning algorithms help companies streamline the processes and increase revenues:

netflix machine learning

amazon machine learning

Before feeding your data set into a machine learning application, you must ensure your data is accurate, consistent and useful enough for the model to learn from.

machine learning algorithm

Here are the steps you should take to make sure you’ve properly prepared your data before using it for machine learning purposes:

1. Pre-process your data

Data can be gathered from any number of sources, and with that comes the possibility that your data isn’t complete or fully accurate. To ensure your data is high quality, and therefore useful, it needs to be pre-processed before being used in a model. Otherwise, you’ll be following the practice of putting “garbage” in.

To begin pre-processing, identify any data sets that need to be cleaned. You can perform a data health check by identifying the following elements:            

  • The number of records and attributes your data contains
  • The attribute data types your data contains, and whether they are nominal, ordinal or continuous
  • Data that is missing or incomplete, including data that lacking specific values, entries or necessary attributes
  • Inconsistent data containing conflicting records or records that don’t include the proper data ranges
  • Noisy data containing inaccurate, incorrect or unnecessary records

The well-formedness of your data in each particular file format. For data in CSV or TSV files, ensure column and line separators are correctly separating columns and lines; for HTML or XML data, ensure data follows each format’s specific data standards. Semi-structured or unstructured data may require additional parsing to extract a structured data set.

If your data health check produces data sets with issues, you’ll need to further process your data in order to make it useable and useful.

2. Clean and process your data

Data cleaning must be carried out when you’ve identified potential issues with your data set. With dirty, incomplete, noisy or otherwise “garbage” data, machine learning software won’t produce results that are accurate or complete. This, in turn, builds models that learn from bad examples. Here are the steps to take when performing a data cleanse:

  • Clean your data by taking incomplete data sets and filling in missing values or removing them altogether, along with removing noisy data and outliers.
  • Clean your text by identifying and repairing issues with the text that may cause data to become misaligned, such as embedded special characters, tabs or line breaks.
  • Transform and reduce data. Too much data can get in the way of efficient processing. Transforming data sets and removing sample records or attributes can help put together a leaner and more efficient database.
  • Normalize data by ensuring that your data has collected specific values, each value has its own variable, each variable has its own observation and each observation is a variable of a unit. Within larger variables, define whether they are fixed or measured.

With spreadsheets that contain hundreds of thousands of entries, the data cleansing process can take a significant amount of time. But if you neglect to ensure your data is clean, useful and easy to process, low-quality results will hamper your machine learning efforts.

3. Bring clean data to life

Once your data has been cleansed, it’s ready for use in data analysis. Machine learning algorithms get the most out of clean data sets to carry out the following tasks:

data visualization

Clean data ensures data-dependent tasks won’t produce “garbage” visuals, models and organization.

4. Follow data collection best practices

One of the best ways to ensure your data is clean is to collect clean data from the get-go. When putting together a new data set, follow best data collection practices to save time and prevent the need for future data cleansing work:

  • Ensure the data you’re gathering is necessary and useful.
  • Eliminate categories, attributes or values that may be duplicated elsewhere.
  • Use standard formatting for characters, tabs and spacing.

Take time to monitor that data collection best practices are being followed. Periodically check your data sets to correct any potential bad entries, and adjust data collection to ensure it’s not producing the wrong results.

The principle of “garbage in, garbage out” serves as a useful reminder that in the world of machine learning, quality is everything. Considering the vast amounts of data machine learning algorithms are tasked with processing, leaving “garbage” at the curb is essential to building applications that serve useful and specific functions. By following data collection best practices and thoroughly cleaning existing sets of data, you’ll help guarantee machine learning tools are operating as intelligently as the person who took the time to care for its information.

Need professional assistance to build machine learning algorithms and get better use of your data?  Tell us about your project and our skilled AI specialists will translate your ideas into efficient AI solutions to solve your business tasks.