Название: Data-Intensive Text Processing with MapReduce
Автор: Jimmy Lin
Издательство: Ingram
Жанр: Программы
Серия: Synthesis Lectures on Human Language Technologies
isbn: 9781608453436
isbn:
2.4 Partitioners and Combiners
2.5 The Distributed File System
2.6 Hadoop Cluster Architecture
2.7 Summary
3.1 Local Aggregation
3.1.1 Combiners and In-Mapper Combining
3.1.2 Algorithmic Correctness with Local Aggregation
3.2 Pairs and Stripes
3.3 Computing Relative Frequencies
3.4 Secondary Sorting
3.5 Relational Joins
3.5.1 Reduce-Side Join
3.5.2 Map-Side Join
3.5.3 Memory-Backed Join
3.6 Summary
4 Inverted Indexing for Text Retrieval
4.1 Web Crawling
4.2 Inverted Indexes
4.3 Inverted Indexing: Baseline Implementation
4.4 Inverted Indexing: Revised Implementation
4.5 Index Compression
4.5.1 Byte-Aligned and Word-Aligned Codes
4.5.2 Bit-Aligned Codes
4.5.3 Postings Compression
4.6 What About Retrieval?
4.7 Summary and Additional Readings
5.1 Graph Representations
5.2 Parallel Breadth-First Search
5.3 PageRank
5.4 Issues with Graph Processing
5.5 Summary and Additional Readings
6 EM Algorithms for Text Processing
6.1 Expectation Maximization
6.1.1 Maximum Likelihood Estimation
6.1.2 A Latent Variable Marble Game
6.1.3 MLE with Latent Variables
6.1.4 Expectation Maximization
6.1.5 An EM Example
6.2 Hidden Markov Models
6.2.1 Three Questions for Hidden Markov Models
6.2.2 The Forward Algorithm
6.2.3 The Viterbi Algorithm
6.2.4 Parameter Estimation for HMMs
6.2.5 Forward-Backward Training: Summary
6.3 EM in MapReduce
6.3.1 HMM Training in MapReduce
6.4 Case Study: Word Alignment for Statistical Machine Translation
6.4.1 Statistical Phrase-Based Translation
6.4.2 Brief Digression: Language Modeling with MapReduce
6.4.3 Word Alignment
6.4.4 Experiments
6.5 EM-Like Algorithms
6.5.1 Gradient-Based Optimization and Log-Linear Models
6.6 Summary and Additional Readings
7.1 Limitations of MapReduce
7.2 Alternative Computing Paradigms
7.3 MapReduce and Beyond
Acknowledgments
The first author is grateful to Esther and Kiri for their loving support. He dedicates this book to Joshua and Jacob, the new joys of his life.
The second author would like to thank Herb for putting up with his disorderly living habits and Philip for being a very indulgent linguistics advisor.
This work was made possible by the Google and IBM Academic Cloud Computing Initiative (ACCI) and the National Science Foundation’s Cluster Exploratory (CLuE) program, under award IIS-0836560, and also award IIS-0916043. Any opinions, findings, conclusions, or recommendations expressed in this book are those of the authors and do not necessarily reflect the views of the sponsors.
We are grateful to Jeff Dean, Miles Osborne, Tom White, as well as numerous other individuals who have commented on earlier drafts of this book.
Jimmy Lin and Chris Dyer
May 2010
CHAPTER 1
Introduction
MapReduce [45] is a programming model for expressing distributed computations on massive СКАЧАТЬ