Machine Learning For Dummies. John Paul Mueller
Чтение книги онлайн.

Читать онлайн книгу Machine Learning For Dummies - John Paul Mueller страница 17

СКАЧАТЬ one of the methods of machine learning that you consider in this book, is a method of describing problems using math. By combining big data with statistics, you can create a machine learning environment in which the machine considers the probability of any given event. However, saying that statistics is the only machine learning method is incorrect. This chapter also introduces you to the other forms of machine learning currently in place.

Before an algorithm can do much in the way of machine learning, you must train it. The training process modifies how the algorithm views big data. It’s essential to understand that training is actually using a subset of the data as a method for creating the patterns that the algorithm needs to recognize specific cases from the more general cases that you provide as part of the training.

      Big data is substantially different from being just a large database. Yes, big data implies lots of data, but it also includes the idea of complexity and depth. A big data source describes something in enough detail that you can begin working with that data to solve problems for which general programming proves inadequate.

      As an example of big data complexity, consider Google’s self-driving cars (https://waymo.com/). The car must consider not only the mechanics of the car’s hardware and position with space but also the effects of human decisions, road conditions, environmental conditions, and other vehicles on the road, which is why our roads aren’t crowded with them yet (see https://www.vox.com/future-perfect/2020/2/14/21063487/self-driving-cars-autonomous-vehicles-waymo-cruise-uber). It’s not hard to imagine some of the human-specific issues that self-driving cars will need to address, such as people taking a nap when they should be watching the road even with the self-driving car in control (https://robbreport.com/motors/cars/canadian-police-arrest-sleeping-driver-tesla-autopilot-1234570071/).

      The data source for a self-driving car (or any other complex endeavor for that matter) contains many variables — all of which affect the vehicle in some way. Traditional programming might be able to crunch all the numbers, but not in real time. You don’t want the car to crash into a wall and have the computer finally decide five minutes later that the car is going to crash into a wall. The processing must prove timely so that the car can avoid the wall.

      The acquisition of big data can also prove daunting. The sheer bulk of the dataset isn’t the only problem to consider — also essential is to consider how the dataset is stored and transferred so that the system can process it. In most cases, developers try to store the dataset in memory to allow fast processing. Using a hard drive to store the data would prove too costly, time-wise.

      JUST HOW BIG IS BIG?

      Big data can really become quite big. For example, suppose that your Google self-driving car has a few HD cameras and a couple hundred sensors that provide information at a rate of 100 times/s. What you might end up with is a raw dataset with input that exceeds 100 Mbps. Processing that much data is incredibly hard.

      Part of the problem right now is determining how to control big data. Currently, the attempt is to log everything, which produces a massive, detailed dataset. However, this dataset isn’t well formatted, again making it quite hard to use. As this book progresses, you discover techniques that help control both the size and the organization of big data so that the data becomes useful in making predictions.

When thinking about big data, you also consider anonymity. Big data presents privacy concerns. However, because of the way machine learning works, knowing specifics about individuals isn’t particularly helpful anyway. Machine learning is all about determining patterns — analyzing training data in such a manner that the trained algorithm can perform tasks that the developer didn’t originally program it to do. Personal data has no place in such an environment.

      Finally, big data is so large that humans can’t reasonably visualize it without help. Part of what defines big data as big is the fact that a human can learn something from it, but the sheer magnitude of the dataset makes recognition of the patterns impossible (or would take a really long time to accomplish). Machine learning helps humans make sense of and use big data.

      Before you can use big data for a machine learning application, you need a source of big data. Of course, the first thing that most developers think about is the huge, corporate-owned database, which could contain interesting information, but it’s just one source. The fact of the matter is that your corporate databases might not even contain particularly useful data for a specific need. The following sections describe locations you can use to obtain additional big data.

      Building a new data source

      Obtaining data from public sources

      Governments, universities, nonprofit organizations, and other entities often maintain publicly available databases that you can use alone or combined with other databases to create big data for machine learning. For example, you can combine several Geographic Information Systems (GIS) to help create the big data required to make decisions such as where to put new stores or factories. The machine learning algorithm can take all sorts of information into account — everything from the amount of taxes you have to pay to the elevation of the land (which can contribute to making your store easier to see).

      The best part about using public data is that it’s usually free, even for commercial use (or you pay a nominal fee for it). In addition, many of the organizations that created them maintain these sources in nearly perfect condition because the organization has a mandate, uses the data to attract income, or uses the data internally. When obtaining public source data, you need to consider a number of issues to ensure that you actually get something useful. Here are some of the criteria you should think about when making a decision:

       The cost, if any, of using the data source

       The formatting of the data source

       Access to the data source (which means having the proper infrastructure in place, such as an Internet connection when using Twitter data)

       Permission to use the data source (some data sources are copyrighted)

       Potential issues in cleaning the data to make it useful for machine learning

       Potential security issues in accessing the data, adding it to other data sources, and managing it locally

СКАЧАТЬ