Machine Learning For Dummies. John Paul Mueller
Чтение книги онлайн.

Читать онлайн книгу Machine Learning For Dummies - John Paul Mueller страница 18

СКАЧАТЬ that the data is the original data, rather than data that purports to be original but has been biased or modified in other ways that would change the results of using it

       Determining that the data doesn’t contain personally identifiable information that the data source originator may not have permission to use. (Chapter 22 covers issues like this one.)

      Obtaining data from private sources

      You can obtain data from private organizations such as Amazon (see Open Data, https://aws.amazon.com/opendata/) and Google (see Public Data Explorer, https://www.google.com/publicdata/directory), both of which maintain immense databases that contain all sorts of useful information. In some cases, except for publicly shared data sources, you should expect to pay for access to the data, especially when used in a commercial setting. You may not be allowed to download the data to your personal servers, so that restriction may affect how you use the data in a machine learning environment. For example, some algorithms work slower with data that they must access in small pieces.

      The biggest advantage of using data from a private source is that you can expect better consistency. The data is likely cleaner than from a public source. In addition, you usually have access to a larger database with a greater variety of data types. Of course, it all depends on where you get the data.

      Creating new data from existing data

      Your existing data may not work well for machine learning scenarios, but that doesn’t keep you from creating a new data source using the old data as a starting point. For example, you might find that you have a customer database that contains all the customer orders, but the data isn’t useful for machine learning because it lacks tags required to group the data into specific types. One of the new job types that you can expect to create is people who massage data to make it better suited for machine learning — including the addition of specific information types such as tags.

      

Machine learning will have a significant effect on your business. The article at https://www.computerworld.com/article/3007053/big-data/how-machine-learning-will-affect-your-business.html describes some of the ways in which you can expect machine learning to change how you do business. One of the points in this article is that machine learning typically works on 80 percent of the data. In 20 percent of the cases, you still need humans to take over the job of deciding just how to react to the data and then act upon it. The point is that machine learning saves money by taking over repetitious tasks that humans don’t really want to do in the first place (making them inefficient). However, machine learning doesn’t get rid of the need for humans completely, and it creates the need for new types of jobs that are a bit more interesting than the ones that machine learning has taken over. Also important to consider is that you need more humans at the outset until the modifications they make train the algorithm to understand what sorts of changes to make to the data.

      Using existing data sources

      Your organization has data hidden in all sorts of places. The problem is in recognizing the data as data. For example, you may have sensors on an assembly line that track how products move through the assembly process and ensure that the assembly line remains efficient. Those same sensors can potentially feed information into a machine learning scenario because they could provide inputs on how product movement affects customer satisfaction or the price you pay for postage. The idea is to discover how to create mashups that present existing data as a new kind of data that lets you do more to make your organization work well.

      

Big data can come from any source, even your email. The article at https://www.semrush.com/blog/deep-learning-an-upcoming-gmail-feature-that-will-answer-your-emails-for-you/ discusses how Google uses your email to create a list of potential responses for new emails. You can read about the process involved for the user at https://www.lifewire.com/how-to-send-canned-replies-automatically-in-gmail-1172080. Instead of having to respond to every email individually, you can simply select a canned response at the bottom of the page. This sort of automation isn’t possible without the original email data source. Looking for big data in specific locations will blind you to the big data sitting in common places that most people don’t think about as data sources. Tomorrow’s applications will rely on these alternative data sources, but to create these applications, you must begin seeing the data hidden in plain view today.

      Some of these applications already exist, and you’re completely unaware of them. The video at https://research.microsoft.com/apps/video/default.aspx?id=256288 makes the presence of these kinds of applications more apparent. By the time you complete the video, you begin to understand that many uses of machine learning are already in place and users already take them for granted (or have no idea that the application is even present). Many developers see the quest toward an ultimate machine learning experience as the master algorithm, which is the topic of a book entitled The Master Algorithm, by Pedro Domingos (https://www.amazon.com/exec/obidos/ASIN/0465094279/datacservip0f-20/).

      Locating test data sources

      In some cases, you might not have enough data at the outset for both training (the essential initial test) and testing. When this happens, you might need to create a test setup to generate more data, rely on data generated in real time, or create the test data source artificially. You can also use similar data from existing sources, such as a public or private database. The point is that you need both training and testing data that will produce a known result before you unleash your algorithm into the real world of working with uncertain data.

      Some sites online would have you believe that statistics and machine learning are two completely different technologies. For example, when you read Statistics vs. Machine Learning, fight! (http://brenocon.com/blog/2008/12/statistics-vs-machine-learning-fight/), СКАЧАТЬ