Sunday, July 7, 2019

Machine Learning Tip: Set Boundaries for the Problems

We cannot take a giant pile of unorganized data, shove it into a machine, and expect useful results
Jonathan Bartlett 
July 1, 2019
To succeed in machine learning, we must do a decent amount of prep work. Just adding data, data, data can lead to false signals and invalid correlations. We can end up missing the signal in all the noise.
In “Why Machine Learning Works,” computer scientist George Montañez walks the reader through the prerequisites for successful machine learning. He notes that, at its core, machine learning is a form of search algorithm. You are searching for a model that matches your dataset.
The problem is that a search algorithm can only perform so many searches. While computers may be incomprehensibly fast, they are not infinitely fast. In fact, the size of many search spaces is so vast that they could not be searched in the entire history of the universe. Therefore, in order to make machine learning work, steps must be taken to limit the machine’s search space in order to get results.
The first part of this process is data selection. Humans intrinsically understand causation and, therefore know which pieces of data likely have some correlation. Therefore, when we select data for computers to analyze, we are drastically reducing the size of the problem for computers.
For instance, while you might ask the computer to analyze the relationship between weather patterns and crop performance, you wouldn’t ask it to analyze the relationship between football team performance and crop performance, or TV lineups and crop performance.
That might seem like a trivially obvious point. What may not be obvious is the impact of basic data selection on the ability of models to find patterns in the data.
Additionally, the way that the data is represented is important. If you were representing text to a computer, you can choose from a number of ways. It can be an image of letters on paper, a word processing document, a string of letters, or a string of words. Or it can be preprocessed to give specific meanings to groups of words. Depending on what you are looking for, different representations make more sense. This choice of representation is another way that humans reduce the search space to enable machine learning.
Once the datasets are selected and their format for the computer is established, the machine learning algorithm must be selected. This choice, again, depends on how the user expects the data to relate to each other. Will the user choose a neural network? A decision tree? A bayesian network? Something more specialized? Again, this choice of model significantly reduces the total amount of search that a machine learning algorithm must perform.
All of these operations can be grouped under one category—parameterization. We are “parameterizing” the machine learning problem. We take an abstract problem and establish parameters by which the search will be done. Each parameterization step reduces the amount of search that the machine learning algorithm must perform. Successful parameterization happens when the user’s choice of parameters matches the reality of the situation, making the algorithm faster.
In the end, the success of machine learning projects depends on the ability of users to properly parameterize their data so that a machine can search out patterns on it efficiently. Otherwise, the search space of possible events is simply too large to process.