Items related to Big Data With Pyspark: Processing Large Datasets: A...

Big Data With Pyspark: Processing Large Datasets: A Hands-On Guide To Distributed Data Engineering, Machine Learning And Big Data Pipelines With Apache Spark And Python - Softcover

 
9798290030715: Big Data With Pyspark: Processing Large Datasets: A Hands-On Guide To Distributed Data Engineering, Machine Learning And Big Data Pipelines With Apache Spark And Python

Synopsis

You'll Learn

  • Understand the Foundations of Big Data and Distributed Computing: Gain a solid grasp of Big Data concepts, including the 5 Vs, the challenges of traditional systems, and the fundamental principles of distributed computing like parallelism, fault tolerance, and scalability.

  • Master the PySpark Ecosystem: Learn the architecture of Apache Spark, its core components (Spark SQL, Structured Streaming, MLlib, GraphFrames), and how the PySpark API seamlessly integrates with Python.

  • Set Up Your PySpark Environment: Get hands-on experience setting up a complete development environment on your local machine and learn how to run applications in various cloud platforms like Databricks, AWS EMR, and Google Cloud Dataproc.

  • Process Data with RDDs and DataFrames: Master Spark's core data structures, from the low-level RDDs to the powerful and optimized DataFrames. Learn to apply a wide range of transformations and actions for data manipulation.

  • Perform Advanced Data Wrangling and Feature Engineering: Acquire skills in data cleaning, handling missing values and duplicates, and performing complex transformations using Spark SQL, Window Functions, and User-Defined Functions (UDFs), including high-performance Pandas UDFs.

  • Connect to Diverse Data Sources: Read and write data from various formats (CSV, JSON, Parquet) and connect to external systems like relational databases (JDBC), NoSQL stores (Cassandra, MongoDB), and cloud storage (S3, ADLS).

  • Build Real-Time Data Pipelines: Implement modern, fault-tolerant data ingestion with Structured Streaming, including handling event time, watermarking, and performing stateful transformations for real-time analytics.

  • Apply Machine Learning at Scale with MLlib: Learn to build and evaluate distributed machine learning pipelines for classification, regression, and clustering tasks using Spark's MLlib library.

  • Analyze Graph-Structured Data: Explore the power of GraphFrames to model and analyze complex relationships, run graph algorithms like PageRank, and find patterns in network data.

  • Optimize PySpark Applications for Performance: Dive deep into performance tuning, including understanding DAGs and shuffles, managing partitioning, optimizing joins, and configuring memory settings to make your code run faster and more efficiently.

  • Monitor, Debug, and Deploy Applications: Utilize the Spark UI to monitor your jobs, troubleshoot common errors, and learn to package and deploy your PySpark applications to different cluster managers like YARN and Kubernetes.

  • Solve Real-World Big Data Problems: Apply your knowledge through practical case studies, including building a recommendation engine, a real-time fraud detection system, and an ETL pipeline, to solidify your skills and build a portfolio.

"synopsis" may belong to another edition of this title.

Buy New

View this item

£ 7.33 shipping from U.S.A. to United Kingdom

Destination, rates & speeds

Search results for Big Data With Pyspark: Processing Large Datasets: A...

Stock Image

Publishing, PythQuill
Published by Independently published, 2025
ISBN 13: 9798290030715
New Softcover
Print on Demand

Seller: California Books, Miami, FL, U.S.A.

Seller rating 5 out of 5 stars 5-star rating, Learn more about seller ratings

Condition: New. Print on Demand. Seller Inventory # I-9798290030715

Contact seller

Buy New

£ 17.35
Convert currency
Shipping: £ 7.33
From U.S.A. to United Kingdom
Destination, rates & speeds

Quantity: Over 20 available

Add to basket