Machine Learning Tip: Set Boundaries for the Problems
We cannot take a giant pile of unorganized
data, shove it into a machine, and expect useful results
Jonathan Bartlett
July 1, 2019
Share
To succeed in machine learning, we must do a decent amount of
prep work. Just adding data, data, data can lead to false signals and invalid
correlations. We can end up missing the signal in all the noise.
In “Why Machine Learning Works,” computer
scientist George MontaƱez walks the reader through
the prerequisites for successful machine learning. He notes that, at its core,
machine learning is a form of search algorithm. You are searching for a model
that matches your dataset.
The problem is that a search algorithm can only perform so many
searches. While computers may be incomprehensibly fast, they are not infinitely
fast. In fact, the size of many search spaces is so vast that they could not be
searched in the entire history of the universe. Therefore, in order to make
machine learning work, steps must be taken to limit the machine’s search space
in order to get results.
The first part of this process is data selection. Humans
intrinsically understand causation and, therefore know which pieces of data
likely have some correlation. Therefore, when we select data for computers to
analyze, we are drastically reducing the size of the problem for computers.
For instance, while you might ask the computer to analyze the
relationship between weather patterns and crop performance, you wouldn’t ask it
to analyze the relationship between football team performance and crop
performance, or TV lineups and crop performance.
That might seem like a trivially obvious point. What may not be
obvious is the impact of basic data selection on the ability of models to find
patterns in the data.
Additionally, the way that the data is represented is important.
If you were representing text to a computer, you can choose from a number of
ways. It can be an image of letters on paper, a word processing document, a
string of letters, or a string of words. Or it can be preprocessed to give
specific meanings to groups of words. Depending on what you are looking for,
different representations make more sense. This choice of representation is
another way that humans reduce the search space to enable machine learning.
Once the datasets are selected and their format for the computer
is established, the machine learning algorithm must be selected. This choice,
again, depends on how the user expects the data to relate to each other. Will
the user choose a neural network? A decision tree? A bayesian network?
Something more specialized? Again, this choice of model significantly reduces
the total amount of search that a machine learning algorithm must perform.
All of these operations can be grouped under one category—parameterization. We
are “parameterizing” the machine learning problem. We take an abstract problem
and establish parameters by which the search will be done. Each
parameterization step reduces the amount of search that the machine learning
algorithm must perform. Successful parameterization happens when the user’s
choice of parameters matches the reality of the situation, making the algorithm
faster.
In the end, the success of machine learning projects depends on
the ability of users to properly parameterize their data so that a machine can
search out patterns on it efficiently. Otherwise, the search space of possible
events is simply too large to process.