This book lays out the theoretical groundwork for personalized search and reputation management, both on the Web and in peer-to-peer and social networks. Representing much of the foundational research in this field, the book develops scalable algorithms that exploit the graphlike properties underlying personalized search and reputation management, and delves into realistic scenarios regarding Web-scale data.
Sep Kamvar focuses on eigenvector-based techniques in Web search, introducing a personalized variant of Google's PageRank algorithm, and he outlines algorithms--such as the now-famous quadratic extrapolation technique--that speed up computation, making personalized PageRank feasible. Kamvar suggests that Power Method-related techniques ultimately should be the basis for improving the PageRank algorithm, and he presents algorithms that exploit the convergence behavior of individual components of the PageRank vector. Kamvar then extends the ideas of reputation management and personalized search to distributed networks like peer-to-peer and social networks. He highlights locality and computational considerations related to the structure of the network, and considers such unique issues as malicious peers. He describes the EigenTrust algorithm and applies various PageRank concepts to P2P settings. Discussion chapters summarizing results conclude the book's two main sections.
Clear and thorough, this book provides an authoritative look at central innovations in search for all of those interested in the subject.
"synopsis" may belong to another edition of this title.
Sep Kamvar is a consulting assistant professor of computational mathematics at Stanford University. From 2003 to 2007, he was the engineering lead for personalization at Google. He is the founder and former CEO of Kaltix, a personalized search engine acquired by Google in 2003.
"The writing style is extremely clear, and the book is accessible to readers both within and outside of the field."--Chen Greif, University of British Columbia
"The clarity of presentation makes this book accessible to a broad audience. The scholarship is thorough and sound, and the experimental results are presented in a precise and detailed fashion."--Taher Haveliwala, QForge Labs
"Kamvar helped establish a foundation for P2P search and this book provides an authoritative record and source for his excellent work in this area."--Andrew Tomkins, Google
Tables.....................................................................ixFigures....................................................................xiAcknowledgments............................................................xvChapter 1 Introduction.....................................................1PART I WORLD WIDE WEB......................................................5Chapter 2 PageRank.........................................................7Chapter 3 The Second Eigenvalue of the Google Matrix.......................15Chapter 4 The Condition Number of the PageRank Problem.....................20Chapter 5 Extrapolation Algorithms.........................................23Chapter 6 Adaptive PageRank................................................42Chapter 7 BlockRank........................................................51PART II P2P NETWORKS.......................................................73Chapter 8 Query-Cycle Simulator............................................75Chapter 9 EigenTrust.......................................................84Chapter 10 Adaptive P2P Topologies.........................................108Chapter 11 Conclusion......................................................133Bibliography...............................................................135
Distributed, self-organizing networks such as the World Wide Web and peer-to-peer networks allow for fast access to vast quantities of diverse information for a large number of users. However, with such large scale and data diversity comes the challenge of finding relevant data from reputable sources in an efficient manner.
This book, addresses the issues of relevance and reputation by exploiting user preference information to perform reputation management and personalized search. The issues of personalization and reputation management are highly intertwined, in terms of both the basic ideas and the underlying technologies. Personalization exploits the preferences of an individual to bias search toward that individual's preferences, while reputation management aggregates the preferences of all individuals to bias search toward the data sources that are deemed reputable by the group.
The ideas of reputation and personalization are powerful in conjunction. For example, a personalized Web search for the term "giants" would return the official site of the New York Giants to a football fan from New York, while the same query would return the official site of the San Francisco Giants to a baseball fan from San Francisco. Personalization takes advantage of the local context to return the right sports team, and reputation takes advantage of the global context to return the official site of the corresponding team, rather than some random fan page.
In large-scale diverse data networks, a query will often have so many results that the challenge lies in finding those that are most relevant and reputable. When traditional IR keyword matching techniques are combined with the dual techniques of personalization and reputation management, the end user is likely to have to spend less time intelligently formulating a query and filtering through irrelevant data.
This book is written in two parts, the first part focusing on the Web, and the second on peer-to-peer networks. These parts, while they share the themes of reputation and personalization, differ in style as well as application area. Part I has more mathematical proofs, while Part II relies more on simulation and experimentation. You may read them independently, but together they will give a broader view of the world. Both Part I and Part II, however, are meant for an audience comfortable with advanced concepts in math and computer science.
1.1 WORLD WIDE WEB
Google's PageRank algorithm [56] revolutionized Web search by providing a reliable, spam-resistant way to find reputable web pages. The algorithm is based on the idea that a link from page i to page j confers authority on page j. Therefore, pages with many links from reputable pages are themselves reputable. Part I of this book addresses the issue of personalizing the PageRank algorithm for individual users.
The PageRank algorithm involves the computation of the dominant eigenvector of a Markov matrix describing the behavior of a model Web surfer jumping from page to page on the Web hyperlink graph. Chapter 2 reviews the PageRank algorithm and the random surfer model. Chapters 3 and 4 introduce some mathematical properties of PageRank that guide how we proceed in algorithm design.
It has been suggested that, by biasing the behavior of the model surfer to reflect the biases of a given user, PageRank can be personalized for each individual user [56]. However, due to the sheer size of the web matrix, doing an individual eigenvector computation for each user is prohibitively expensive, and a computationally tractable algorithm for Personalized PageRank has remained an open problem since it was suggested in 1998.
Chapters 5 through 7 discuss techniques for accelerating PageRank in order make the idea of Personalized PageRank computationally tractable. Personalized PageRank is presented at the end of Chapter 7.
Much of the content in Part I represents joint work with Taher Haveliwala, Glen Jeh, Chris Manning, and Gene Golub.
1.2 P2P NETWORKS
Part II addresses the idea of reputation and personalization in the context of file-sharing peer-to-peer networks. Due to the highly distributed nature of P2P networks, the technical challenges here are different from those described for Personalized PageRank. The first challenge is to devise an algorithm that computes and stores reputation in a distributed manner with minimal overhead and that is resistant to malicious users. Chapter 9 describes the EigenTrust algorithm for reputation management in P2P systems. Since queries in a large-scale P2P network have a limited time horizon, personalizing P2P search can be achieved by designing the topology of a P2P network such that each peer is surrounded by peers that are likely to store data of interest to that peer. In Chapter 10, a peer-level protocol is presented for the self-organization of such a P2P topology. These protocols are tested using a P2P simulator called the Query-Cycle simulator, described in Chapter 8.
Much of the content in Part II represents joint work with Mario Schlosser, Tyson Condie, and Hector Garcia-Molina.
1.3 CONTRIBUTIONS
The work presented in this book offers three main contributions to research in information retrieval. The first is a mathematical analysis of PageRank, including convergence and stability guarantees. While convergence and stability have been observed empirically for PageRank, domain-independent guarantees are useful when proposing PageRank-like algorithms in other problem domains. Furthermore, convergence and stability analysis generally lays a foundation for future work in numerical algorithms. In this case, the convergence analysis of PageRank suggests that future algorithms should be based on the Power Method.
The second main contribution of this work is the presentation of a scalable, personalized PageRank algorithm for Web search. In particular, we use properties of the problem and the domain to speed up the PageRank algorithm. The properties of the matrix (sparsity and large eigengap) lead us to use algorithms based on the Power Method throughout the book, and the extrapolation algorithms specifically exploit the matrix properties. The domain properties of the Web as a hierarchical dynamic system lead us to the Adaptive PageRank and BlockRank algorithms. And finally, the linearity of PageRank, another property of the problem, allows us to use all these algorithms in conjunction with Topic-Sensitive PageRank. The scalability issues have long been a bottleneck for the successful deployment of personalized search on the scale of the web, and this book addresses those issues.
The third main contribution is bringing the ideas of reputation and personalization to search in P2P networks. As the quantity and diversity of data on P2P networks approaches that of the web, the importance of search quality in P2P networks becomes increasingly important. The recent focus of research in P2P search has been efficiency for point queries (exact-match queries). However, while efficiency for point queries is important, point queries represent only a small fraction of possible queries in today's P2P networks. Three main ideas are presented within this contribution. The first is the understanding that the ideas behind PageRank can also be applied to search in P2P networks. The second is a method of computing the dominant eigenvector in a highly distributed and potentially subversive environment. These are the basis of the EigenTrust algorithm. The third is a recognition that the local neighborhood is more important than a differential quality score for personalization in P2P search, where queries are only broadcast across a limited time horizon. This is the basis of Adaptive P2P Topologies.
2.1 PAGERANK BASICS
The PageRank algorithm for determining the reputation of Web pages has become a central technique in Web search [56]. The core of the PageRank algorithm involves computing the principal eigenvector of the Markov matrix representing the hyperlink structure of the Web. As the Web graph is very large, containing several billion nodes, the PageRank vector is generally computed offline, during the preprocessing of the Web crawl, before any queries have been issued. As discussed in Chapter 1, personalization requires significant advances to the standard PageRank algorithm.
This chapter reviews the standard PageRank algorithm [56] and some of the mathematical tools that will be used in analyzing and improving the standard iterative algorithm for computing PageRank throughout the rest of this book.
Underlying PageRank is the following basic assumption. A link from a Web page u to another page v can be viewed as evidence that v is an "important" page. In particular, the amount of importance conferred on v by u is proportional to the importance of u and inversely proportional to the number of pages u points to. Since the importance of u is itself not known, determining the importance for every page i in the Web requires an iterative fixed-point computation.
To allow for a more rigorous analysis of the necessary computation, we next describe an equivalent formulation in terms of a random walk on the directed Web graph G. (The graph G is the directed graph where each node represents a page on the Web, and an edge between nodes u and v represents a link from page u to page v.) Let u [right arrow] v denote the existence of an edge from u to v in G. Let deg(u) be the outdegree of page u in G. Consider a random surfer visiting page u at time k. In the next time step, the surfer chooses a node [v.sub.i] from among u's out-neighbors {v|u [right arrow] v} uniformly at random. In other words, at time k + 1, the surfer lands at node [v.sub.i] [member of] {v|u [right arrow] v} with probability 1/deg(u).
The PageRank of a page i is defined as the probability that at some particular time step k > K, the surfer is at page i. In the limit as K [right arrow] [infinity], this probability distribution is called the stationary probability distribution.
With minor modifications to the random walk, this probability distribution is unique, illustrated as follows. Consider the Markov chain induced by the random walk on G, where the states are given by the nodes in G and the stochastic transition matrix describing the transition from i to j is given by Q with [Q.sub.ij] = 1/deg(i).
For Q to be a valid transition probability matrix, every node must have at least 1 outgoing transition; that is, Q should have no rows consisting of all zeros. This holds if G does not have any pages with outdegree 0, which is not the case for the Web graph. Q can be converted into a valid transition matrix by adding a complete set of outgoing transitions to pages with outdegree 0. In other words, we can define the new matrix P where all states have at least one outgoing transition in the following way. Let n be the number of nodes (pages) in the Web graph. Let [??] be the n-dimensional column vector representing a uniform probability distribution over all nodes:
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (2.1)
Let [??] be the n-dimensional column vector identifying the nodes with outdegree 0:
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]
Then we construct P' as follows:
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]
In terms of the random walk, the effect of D is to modify the transition probabilities so that a surfer visiting a dangling page (a page with no outlinks) randomly jumps to another page in the next time step, using the distribution given by [??].
By the ergodic theorem for Markov chains [31], the Markov chain defined by P has a unique stationary probability distribution if P is aperiodic and irreducible. In general, neither of these properties holds for the Markov chain induced by the Web graph.
In the context of computing PageRank, the standard way of ensuring these properties is to add a new set of complete outgoing transitions, with small transition probabilities, to all nodes, creating a strongly connected (and thus irreducible) transition graph. Furthermore, adding this set of outgoing transitions will ensure that the transition graph has at least one self-loop, thus guaranteeing aperiodicity. For a more in-depth discussion of these conditions, see [31].
In matrix notation, we construct the aperiodic, irreducible Markov matrix P' as follows:
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII],
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII].
In terms of the random walk, the effect of E is as follows. At each time step, with probability (1 - c), a surfer visiting any node will jump to a random Web page (rather than following an outlink). The destination of the random jump is chosen according to the probability distribution given in [??]. Artificial jumps taken because of E are referred to as teleportation.
By redefining the vector [??] given in (2.1) to be nonuniform, so that D and E add artificial transitions with nonuniform probabilities, the resultant PageRank vector can be biased to prefer certain kinds of pages. For this reason, we refer to [??] as the personalization vector.
For simplicity and consistency with prior work, the remainder of the discussion will be in terms of the transpose matrix, A = [(P').sup.T]; that is, the transition probability distribution for a surfer at node i is given by row i of P' and column i of A.
Note that the edges artificially introduced by D and E never need to be explicitly materialized, so this construction has no impact on efficiency or the sparsity of the matrices used in the computations. In particular, the matrix-vector multiplication [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] can be implemented efficiently using Algorithm 1.
Assuming that the probability distribution over the surfer's location at time 0 is given by [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII], the probability distribution for the surfer's location at time k is given by [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]. The unique stationary distribution of the Markov chain is defined as [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII], which is equivalent to [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] and is independent of the initial distribution [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]. This is simply the principal eigenvector of the matrix A = [(P").sup.T] [31], which is exactly the PageRank vector we would like to compute.
The standard PageRank algorithm computes the principal eigenvector by starting with the uniform distribution [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] and computing successive iterates [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] until convergence. This is known as the Power Method, and is discussed in further detail in Section 2.3. The next section summarizes some of the notation described in the section, and introduces some new notation that will be used later on in this book.
2.2 NOTATION AND MATHEMATICAL PRELIMINARIES
We will use the following notation and simple mathematical preliminaries throughout this book.
(Continues...)
Excerpted from Numerical Algorithms for Personalized Search in Self-organizing Information Networksby Sep Kamvar Copyright © 2010 by Princeton University Press. Excerpted by permission.
All rights reserved. No part of this excerpt may be reproduced or reprinted without permission in writing from the publisher.
Excerpts are provided by Dial-A-Book Inc. solely for the personal use of visitors to this web site.
"About this title" may belong to another edition of this title.
Seller: HPB-Red, Dallas, TX, U.S.A.
Hardcover. Condition: Good. Connecting readers with great books since 1972! Used textbooks may not include companion materials such as access codes, etc. May have some wear or writing/highlighting. We ship orders daily and Customer Service is our top priority! Seller Inventory # S_399509212
Seller: The Book Bin, Salem, OR, U.S.A.
Hardcover. Condition: Very Good. First Edition. First Edition with number line to 1. Hardcover boards very good with light shelf wear. Spine square. Binding sound. Pages crisp and clean. Interior unmarked. Seller Inventory # CORV-BBC-X000385
Seller: Labyrinth Books, Princeton, NJ, U.S.A.
Condition: New. Seller Inventory # 126103
Seller: GreatBookPrices, Columbia, MD, U.S.A.
Condition: As New. Unread book in perfect condition. Seller Inventory # 6343192
Seller: Kennys Bookshop and Art Galleries Ltd., Galway, GY, Ireland
Condition: New. Lays out the theoretical groundwork for personalized search and reputation management, both on the Web and in peer-to-peer and social networks. This book develops scalable algorithms that exploit the graphlike properties underlying personalized search and reputation management, and delves into realistic scenarios regarding Web-scale data. Num Pages: 160 pages, 55 line illus. 11 tables. BIC Classification: UY. Category: (P) Professional & Vocational; (U) Tertiary Education (US: College). Dimension: 165 x 241 x 16. Weight in Grams: 352. . 2010. Hardcover. . . . . Seller Inventory # V9780691145037
Seller: GreatBookPrices, Columbia, MD, U.S.A.
Condition: New. Seller Inventory # 6343192-n
Seller: Rarewaves USA, OSWEGO, IL, U.S.A.
Hardback. Condition: New. This book lays out the theoretical groundwork for personalized search and reputation management, both on the Web and in peer-to-peer and social networks. Representing much of the foundational research in this field, the book develops scalable algorithms that exploit the graphlike properties underlying personalized search and reputation management, and delves into realistic scenarios regarding Web-scale data. Sep Kamvar focuses on eigenvector-based techniques in Web search, introducing a personalized variant of Google's PageRank algorithm, and he outlines algorithms--such as the now-famous quadratic extrapolation technique--that speed up computation, making personalized PageRank feasible. Kamvar suggests that Power Method-related techniques ultimately should be the basis for improving the PageRank algorithm, and he presents algorithms that exploit the convergence behavior of individual components of the PageRank vector. Kamvar then extends the ideas of reputation management and personalized search to distributed networks like peer-to-peer and social networks.He highlights locality and computational considerations related to the structure of the network, and considers such unique issues as malicious peers. He describes the EigenTrust algorithm and applies various PageRank concepts to P2P settings. Discussion chapters summarizing results conclude the book's two main sections. Clear and thorough, this book provides an authoritative look at central innovations in search for all of those interested in the subject. Seller Inventory # LU-9780691145037
Seller: GreatBookPricesUK, Woodford Green, United Kingdom
Condition: New. Seller Inventory # 6343192-n
Quantity: Over 20 available
Seller: GreatBookPricesUK, Woodford Green, United Kingdom
Condition: As New. Unread book in perfect condition. Seller Inventory # 6343192
Quantity: Over 20 available
Seller: Kennys Bookstore, Olney, MD, U.S.A.
Condition: New. Lays out the theoretical groundwork for personalized search and reputation management, both on the Web and in peer-to-peer and social networks. This book develops scalable algorithms that exploit the graphlike properties underlying personalized search and reputation management, and delves into realistic scenarios regarding Web-scale data. Num Pages: 160 pages, 55 line illus. 11 tables. BIC Classification: UY. Category: (P) Professional & Vocational; (U) Tertiary Education (US: College). Dimension: 165 x 241 x 16. Weight in Grams: 352. . 2010. Hardcover. . . . . Books ship from the US and Ireland. Seller Inventory # V9780691145037