- Spark Documentation
- Databricks Spark Knowledge Base
- Spark Programming Guide
- advanced dependency management
- Custom API Examples For Apache Spark - The examples are basic and only for newbies in Scala and Spark
- Welcome to Spark Python API Docs!
- github.com/apache/spark
- SparkTutorials.net - Apache Spark For the Common * Man!
- sparkjava.com/tutorials
- learn hadoop spark by examples
- Running Spark Korean flintrock, pyspark, aws s3, spark sql, jupyter, hadoop, yarn, tuning
- Spark 시작하기 (유용한 사이트 링크)
- Learning Spark With Scala
- 파이썬으로 배우는 빅데이터: Apache Spark Introduction - YouTube
- Apache Spark Scala Tutorial For Korean
- Apache Spark Tutorial 2018 | Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training
- Big Data and Hadoop Tutorial For Beginners | Hadoop Spark Tutorial For Beginners
- Apache Spark Tutorial
- Apache Spark Tutorials
- Apache Spark 101
- (16) Learn Apache Spark ( Databricks ) - Step by Step Guide | LinkedIn
- Spark Internals
- Introduction to Spark Internals
- Start Your Journey with Apache Spark — Part 1
- Start Your Journey with Apache Spark — Part 2
- Start your Journey with Apache Spark — Part 3
- Getting started with Spark & batch processing frameworks | by Hoa Nguyen | Insight
- Spark Internal
- 52. Apache Spark Internal architecture jobs stages and tasks || Spark Cluster Architecture Explained - YouTube
- pubdata.tistory.com/category/Lecture_SPARK
- Apache Spark - Executive Summary
- Teach yourself Apache Spark – Guide for nerds!
- Apache Spark - cyber.dbguide.net
- Stanford CS347 Guest Lecture: Apache Spark
- BerkeleyX: CS100.1x Introduction to Big Data with Apache Spark
- bigdatauniversity.com
- Apache Spark Full Course | Spark Tutorial For Beginners | Complete Spark Tutorial | Simplilearn - YouTube
- Top 5 Online Courses to Learn Apache Spark in 2022 - Best of Lot
- Introduction to Spark
- Introduction to Apache Spark with Scala
- Python and Bigdata - An Introduction to Spark (PySpark)
- Introduction to Apache Spark - Knoldus Blogs
- Spark Programming
- Intro to Apache Spark Training - Part 1
- Cloudera
- Cloudera Engineering Blog · Spark Posts
- How-to: Tune Your Apache Spark Jobs (Part 1)
- How-to: Tune Your Apache Spark Jobs (Part 2)
- LSA-ing Wikipedia with Apache Spark
- Making Apache Spark Testing Easy with Spark Testing Base
- Getting Apache Spark Customers to Production
- Why Your Apache Spark Job is Failing
- How to use Apache Spark with CDP Operational Database Experience - Cloudera Blog
- The Apache Spark @youtube
- Apache spark 소개 및 실습
- Spark 소개 1부
- Spark 소개 2부
- RE: ShootingStar TV 1회 - 아파치 스파크와 RDD
- databricks
- sparkhub.databricks.com
- Examples for Learning Spark
- Project Tungsten: Bringing Spark Closer to Bare Metal
- Simplifying Big Data Analytics with Apache Spark
- Databricks Announces General Availability of Its Cloud Platform
- A Deeper Understanding of Spark Internals - Aaron Davidson (Databricks)
- DEVOPS ADVANCED CLASS
- 스파크의 사용 환경 내용 - data bricks
- Data Ingestion using COPY INTO - YouTube
- Data Ingestion using Upload Data UI - YouTube
- Data Ingestion using Auto Loader - YouTube
- dbt Projects Integration in Databricks Workflows - YouTube
- databricks community edition Hands-On Training for Data Science and Machine Learning - YouTube
- What is shuffle read & shuffle write in Apache Spark
- Spark Shuffle Partition과 최적화 – tech.kakao.com
- Scrap your MapReduce! (Or, Introduction to Apache Spark)
- Learning Spark
- Introduction to Data Science with Apache Spark
- HPC is dying, and MPI is killing it
- Spark은 왜 이렇게 유명해지고 있을까?
- Analytics With Apache Spark Is Coming
- Interactive Analytics using Apache Spark
- bicdata
- 고급 분석을 '현실'로 만드는 스파크 -> 머신런닝 알고리즘이 포함 있지만, 고급분석가의 관점으로는 기초적인 알고리즘만 포함
- 모든 것을 더 편하게 만들어주는 스파크 -> M/R 형식의 프로그램은 많이 편해짐. MPI 방식은 지원하지 않음
- 하나 이상의 언어를 말하는 스파크 -> scala, java, python을 지원하지만, scala에 최적화되어 있고 나머지 언어는 좀 불편
- 더 빨리 결과를 도출하는 스파크 -> 성능 테스트를 해보면, SparkStream은 storm보다 느리고, SparkSQL은 Hive보다 느림. 일반적인 Spark 프로그램이 성능이 좋음
- 하둡 개발업체를 가리지 않는 스파크 -> 오픈소스는 대부분 업체를 가리지 않고, 용도와 장단점이 다름
- 실시간 고급 분석 -> 기존(하둡)보다는 빠른 고급분석(??)이기 하지만, 준실시간
- VCNC가 Hadoop대신 Spark를 선택한 이유
- (25) 라인플러스 게임보안개발실...스파크+메소스로 10분 당 15TB 처리
- bcho.tistory.com/tag/Apache Spark
- Spark 노트
- Apache Spark이 왜 인기가 있을까?
- Apache Spark 설치 하기
- Apache Spark 소개 - 스파크 스택 구조
- Apache Spark 클러스터 구조
- Apache Spark - RDD (Resilient Distributed DataSet) 이해하기 - #1/2
- Apache Spark RDD 이해하기 #2 - 스파크에서 함수 넘기기 (Passing function to Spark)
- Apache Spark(스파크) - RDD Persistence (스토리지 옵션에 대해서)
- Apache Spark - Key/Value Paris (Pair RDD)
- Apache Spark-Python vs Scala 성능 비교
- blog.madhukaraphatak.com
- Spark Summit
- Using Cascading to Build Data-centric Applications on Spark
- spark-summit.org/2015
- spark-summit.org/east-2016/schedule
- spark-summit.org/2016/schedule
- Spark Summit 2016 West Training
- Spark Summit Europe 2016 참관기
- OrderedRDD: A Distributed Time Series Analysis Framework for Spark (Larisa Sawyer)
- Just Enough Scala for Spark (Dean Wampler)
- TensorFrames: Deep Learning with TensorFlow on Apache Spark (Tim Hunter)
- SPARK SUMMIT EAST 2017
- SPARK SUMMIT 2017 DATA SCIENCE AND ENGINEERING AT SCALE
- 비트윈 데이터팀의 Spark Summit EU 2017 참가기
- 2018-spark-summit-ai-keynotes-2
- Netflix at Spark+AI Summit 2018
- Spark(1.2.1 -> 1.3.1) 을 위한 Mesos(0.18 -> 0.22.rc) - Upgrade
- RDDS ARE THE NEW BYTECODE OF APACHE SPARK
- Spark RDD Operations-Transformation & Action with Example
- Microbenchmarking Big Data Solutions on the JVM – Part 1
- Spark, Mesos, Zeppelin, HDFS를 활용한 대용량 보안 데이터 분석
- (Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around Comes Around
- IBM, 오픈소스 커뮤니티에 머신러닝 기술 기증
- Productionizing Spark and the Spark Job Server
- is Hadoop dead and is it time to move to Spark
- Spark + S3 + R3 을 이용한 데이터 분석 시스템 만들기 by VCNC
- Parallel Programming with Spark (Part 1 & 2) - Matei Zaharia
- 3 Methods for Parallelization in Spark
- Stream All the Things! Architectures for Data Sets that Never End
- 스트리밍 중심 응용 프로그램 및 데이터 플랫폼 구축
- 서비스를 함께 연결하는 단순성을 보여줌으로써 이벤트 소싱 아키텍처에 대해 동기를 부여
- 실시간 및 분석 사례에 대한 다양한 시스템(Akka, Spark, Flink 및 기타)의 절충에 대해 설명
- Petabyte-Scale Text Processing with Spark
- Combining Druid and Spark: Interactive and Flexible Analytics at Scale
- Interactive Audience Analytics With Spark and HyperLogLog
- Apache Spark Creator Matei Zaharia Interview
- New Developments in Spark
- Spark와 Hadoop, 완벽한 조합 (한국어)
- Spark Architecture: Shuffle
- Deep-dive into Spark Internals & Architecture
- Naytev Wants To Bring A Buzzfeed-Style Social Tool To Every Publisher With Spark
- Spinning up a Spark Cluster on Spot Instances: Step by Step
- Spark Meetup at Uber
- Bay Area Apache Spark Meetup @ Intel
- Can Apache Spark process 100 terabytes of data in interactive mode?
- 넷플릭스 빅데이터 플랫폼 아파치 스팍 통합 경험기
- Data Engineering at Netflix using Apache Spark and Flink with Joan Goyeau - YouTube
- Succinct Spark from AMPLab: Queries on Compressed RDDs
- How-to: Build a Complex Event Processing App on Apache Spark and Drools
- Tuning Spark
- Tuning Java Garbage Collection for Spark Applications
- Improving Spark application performance
- Spark performance tuning eng
- Spark performance tuning Part#2 병렬처리
- Spark performance tuning from the trenches
- Spark tuning for Enterprise System Administrators
- SPARK 설정 Tuning 하기 : 네이버 블로그
- “Fast food” and tips for RDD
- 스칼라ML - 스칼라를 이용한 기계학습 기초(+Spark)
- Secondary Sorting in Spark
- Distributed computing with spark
- Comparing the Dataflow/Beam and Spark Programming Models
- Apache Spark Architecture
- Scala vs. Python for Apache Spark
- Natural Language Processing With Apache Spark
- 맵알, ‘아파치 스파크’ 교육 과정 무료로 공개
- Spark HDFS Integration
- spark textfile load file instead of lines
- Reading Text Files by Lines
- Evening w/ Martin Odersky! (Scala in 2016) +Spark Approximates +Twitter Algebird
- ScalaJVMBigData-SparkLessons.pdf
- Introduction to Spark 2.0 : A Sneak Peek At Next Generation Spark
- Spark Release 2.0.0
- Spark SQL, DataFrames and Datasets Guide
- A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets - When to use them and why
- Introducing Apache Spark 2.0
- Spark 2.0 Technical Preview: Easier, Faster, and Smarter
- Apache Spark 2.0 presented by Databricks co-founder Reynold Xin
- APACHE SPARK 2.0 API IMPROVEMENTS: RDD, DATAFRAME, DATASET AND SQL
- RDD 보다 DataFrame, DataSet 이 속도는 두배 이상, 메모리 사용량은 1/4 미만
- 도저히 DataFrame, DataSet 을 쓸수 없는 데이타(예를들어 기본 API 가 제공하지 않는 변환작업을 해야 하거나, 데이타가 뉴스 본문같은 구조화 할수없는 데이타이거나)가 아니면 RDD 말고 DataFrame, DataSet 사용
- RDD; 자유도가 높음(Programming)
- DataFrame; 자유도가 낮음(SQL-like) 대신 데이터 저장공간, 병렬화, 메모리 사용, 복합 쿼리 실행 플랜 등 아주 여러 부분에서 최적화 작업이 가능하고, 많이 최적화 작업이 되어있음
- Spark 2.0 – Datasets and case classes
- Apache Spark 2.0 Performance Improvements Investigated With Flame Graphs
- Generating Flame Graphs for Apache Spark
- Apache Spark 2.0 Tuning Guide
- Using Apache Spark 2.0 to Analyze the City of San Francisco's Open Data
- Modern Spark DataFrame & Dataset | Apache Spark 2.0 Tutorial
- Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Michael Armbrust
- Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
- Spark 2.0 - by Matei Zaharia
- Spark 2.x Troubleshooting Guide
- Introducing Apache Spark 2.1 Now available on Databricks
- What's New in the Upcoming Apache Spark 2.3 Release?
- Introducing Stream-Stream Joins in Apache Spark 2.3
- ORC improvement in Apache Spark 2.3
- The easiest way to run Spark in production
- Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
- Spark Takes On Dataflow in Benchmark Test
- Stock inference engine using Spring XD, Apache Geode / GemFire and Spark ML Lib. http://pivotal-open-source-hub.github.io/StockInference-Spark
- Learning Spark - 아키텍트를 꿈꾸는 사람들
- Tutorial: Spark-GPU Cluster Dev in a Notebook A tutorial on ad-hoc, distributed GPU development on any Macbook Pro
- GPU Acceleration on Apache Spark™
- Spark에서 GPU를 사용해야하는 이유는 무엇입니까?
- Cluster - spark
- 스파크 클라우데라 하둡 클러스터 원격 입출력 예제
- spark를 이용한 hadoop cluster 원격 입출력
- How we reduced our Apache Spark cluster cost using best practices
- Spark Cluster 구축기. 안녕하세요. 여기어때컴퍼니 공통플랫폼개발팀 데이터 엔지니어 앰버입니다. | by Amber Bae | Mar, 2024 | 여기어때 기술블로그
- Best Practices for Using Apache Spark on AWS
- Apache Spark Key Terms, Explained
- 이렇게 코딩 하면 안된다
- Working effectively with Apache Spark on AWS - Singapore Apache Spark+AI Meetup
- How to export millions of records from Mysql to AWS S3?
- Build a Prediction Engine Using Spark, Kudu, and Impala
- Deep Dive: Apache Spark Memory Management
- Deep Dive: Apache Spark Memory Management
- Apache Spark Memory Management: Deep Dive | LinkedIn
- A Developer’s View into Spark's Memory Model - Wenchen Fan
- option
- spark.executor.cores; node의 코어수
- spark.cores.max 전체 갯수
- e.g.
- worker node가 2개이고 각 node당 8core cpu인데 spark.cores.max를 8로 주면 1개의 노드만 동작
- 두개의 node에서 동작하게 하려면 spark.cores.max를 16으로
- Apache Spark @Scale: A 60 TB+ production use case
- How Do In-Memory Data Grids Differ from Spark?
- Spark에서의 Data Skew 문제
- Skew Mitigation For Facebook PetabyteScale Joins - YouTube
- 처음해보는 스파크(spark)로 24시간안에 부동산 과열 분석해보기
- Intro to Apache Spark for Java and Scala Developers - Ted Malaska (Cloudera)
- Achieving a 300% speedup in ETL with Apache Spark
- Spark의 CSV 파일 작업에 대한 스니펫 소개
- non-distributed version에 비해 Spark는 뛰어난 속도 향상 기능을 제공하며 Parquet과 같은 최적화된 형식으로 변환 할 수 있는 기능을 제공
- Parsing CSV Files in Spark
- Diving into Spark and Parquet Workloads, by Example
- parquet 사용 예제
- Apache Spark에서 컬럼 기반 저장 포맷 Parquet(파케이) 제대로 활용하기
- 입 개발 Spark에서 Parquet 파일 Custom Schema 로 읽어들이기 | Charsyam's Blog
- Writing parquet on HDFS using Spark Streaming
- Experimenting with Neo4j and Apache Zeppelin (Neo4j)-[:LOVES]-(Zeppelin)
- Time-Series Missing Data Imputation In Apache Spark
- Tempo: Distributed Time Series Analysis with Apache Spark™ and Delta Lake - YouTube
- Data Science How-To: Using Apache Spark for Sports Analytics
- Hive on Spark: Getting Started
- Working with UDFs in Apache Spark
- Python, Java, Scala에서 Apache Spark의 UDF, UDAF를 사용하는 간단한 예제
- Apache Spark은 어떻게 가장 활발한 빅데이터 프로젝트가 되었나
- Using Apache Spark for large-scale language model training
- Facebook에서 ngram 모델의 traing pipeline을 Apach Hive에서 Apache Spark으로 전환 시도 중
- 두 가지 솔루션에 대한 설명과 Spark DSL 과 Hive QL의 유연성 비교 및 성능 수치
- Hive and Spark Integration Tutorial
- Working with multiple partition formats within a Hive table with Spark
- Hive는 파티션별로 다른 데이터 형식을 지원, 데이터를 쓰기 최적화된 형식에서 읽기 최적화된 형식으로 변환할 때 사용 가능
- Spark에서 멀티 포맷 테이블을 쿼리할 때 실행 계획이 어떻게 동작하는지 내부 동작 방식에 대해 설명
- On Spark, Hive, and Small Files: An In-Depth Look at Spark Partitioning Strategies
- Integrating Apache Hive with Apache Spark - Hive Warehouse Connector
- How to access Hive from Spark2 on HDP3?
- WRITING TO A DATABASE FROM SPARK
- Processing Solr data with Apache Spark SQL in IBM IOP 4.3
- Apache Spark을 Apach Solr로 연결하는 방법 소개
- Blacklisting in Apache Spark
- Tracking the Money — Scaling Financial Reporting at Airbnb
- The Benefits of Migrating HPC Workloads To Apache Spark
- Spark 작업을 실행하기위한 Apache Zeppelin과 Livy 작업 서버 간의 통합에 대한 최근 개선 사항 설명
- 데이터분석 인프라 구축기 (1/4)
- 데이터분석 인프라 구축기 (2/4)
- 데이터분석 인프라 구축기 (3/4)
- 데이터분석 인프라 구축기 (4/4)
- zipWithIndex, for-yield 예제
- Cloudera session seoul - Spark bootcamp
- Benchmarking Big Data SQL Platforms in the Cloud
- Vanilla Spark, Presto, Impala 보다 DataBricks 플랫폼이 더 빠르다는 주장
- Building QDS: AIR Infrastructure
- Qubole이 Data Platforms 2017 conference 발표한 Air라는 플랫폼에 대한 내용입니다.
- 스파크 스터디 ParkS
- Cost Based Optimizer in Apache Spark 2.2
- Apache Spark 2.2의 Cost Based Optimizer와 TPC-DS benchmark에서 CBO 사용 여부에 관계없이 쿼리 수행 시간을 비교한 결과와 통계 정보 수집 방법 등에 대해 설명
- Apache Spark Core-Deep Dive-Proper Optimization - Daniel Tomes, Databricks
- spark 프레임워크를 활용해 자바 기반 웹 애플리케이션 개발 맛보기
- Bay Area Apache Spark Meetup at HPE/Aruba Networks Summary
- Aruba에서 PySpark 및 GraphFrames의 Databricks를 사용한 데이터 상관 관계에 관한 프레젠테이션
- Apache Spark Professional Training with Hands On Lab
- IBM Cloud 환경에서 DSX Spark를 사용한 데이터 분석 시작하기
- Spark-overflow - A collection of Spark related information, solutions, debugging tips and tricks, etc. PR are always welcome! Share what you know about Apache Spark
- Debugging a long-running Apache Spark application: A War Story
- 장기간 실행되는 Apache Spark 응용 프로그램의 성능 문제를 디버깅하는 방법 대해 설명
- JVM 내부 (예 : 사용자 정의 클래스 로더 및 GC), Spark internal (예 : driver가 broadcast data를 정리하는 방법) 및 이러한 버그를 찾아 내고 확인하는 메트릭 및 모니터링 전략
- A step-by-step guide for debugging memory leaks in Spark Applications | by Shivansh Srivastava | disney-streaming | Nov, 2020 | Medium
- A Deeper Understanding of Spark Internals - Aaron Davidson (Databricks)
- Using Apache Spark to Analyze Large Neuroimaging Datasets
- Goal Based Data Production: The Spark of a Revolution - Sim Simeonov
- Spark Job On Mesos - Log Handling programtic하게 log level별로 원하는 장소에 로그 남기기
- How to log in Apache Spark log4j
- Spark: Web Server Logs Analysis with Scala
- Top 5 Mistakes to Avoid When Writing Apache Spark Applications
- Extreme Apache Spark: how in 3 months we created a pipeline that can process 2.5 billion rows a day
- Locality Sensitive Hashing By Spark
- Partition Index - Selective Queries On Really Big Tables Hive, Impala, Spark 등으로 데이터를 조회할 때 전체 테이블을 검색하지 않도록 클라이언트와 데몬 사이에서 인덱스 맵을 만들고 유지 관리를 하며 쿼리를 파싱해주는 balancer를 구현
- Practical Apache Spark in 10 minutes
- partition의 개수가 지나치게 적게 잡혀서 worker 역시 부족하게 할당되면서 성능 하락 problem e.g
-
다양한 경우에서 자주 발생
-
spark sql optimizer가 업그레이드 되는 게 가장 확실한 해법이지만 그걸 기다릴 수 없기 때문에 repartition을 사용해 강제로 partition 수를 증가
val dataset: Dataset[XXX] = ... dataset.repartition(dataset.rdd.getNumPartitions * 2).map(YYY)...
-
- Apache Spark Scheduler
- Deep Dive into the Apache Spark Scheduler - Xingbo Jiang
- Apache Spark: Scala vs. Java v. Python vs. R vs. SQL
- Scala, Java, Python, R 및 SQL에서 Apache Spark API의 차이점 설명
- 예상대로, JVM 언어를 사용하면 성능 향상
- Exploratory Data Analysis in Spark with Jupyter
- 아파치 스팍 관련 문제점 이야기 + 자바로 게으른 초기화 (2018-07-06) 케빈TV Live
- Working with Nested JSON Using Spark | Parsing Nested JSON File in Spark
- Working with JSON in Apache Spark
- practice - sc.textFile로 gzipped hdfs file을 읽을 경우 성능 저하 or job 실패
- What’s new in Apache Spark 2.3 and Spark 2.4
- What’s new in Spark 2.4!
- Uber’s Big Data Platform: 100+ Petabytes with Minute Latency
- Uber가 Hadoop과 Spark을 이용하여 빅데이터를 수집, 관리, 분석하는 방법 정리
- Spark study notes: core concepts visualized
- Spark과 YARN이 상호 작용하는 방식과 작업이 각 단계에서 어떻게 작동하는지 설명하는 기초 문서
- Just Enough Spark! Core Concepts Revisited !! | LinkedIn
- Python vs. Scala
- Points to remember while processing streaming timeseries data in order using Kafka and Spark
- A Journey Into Big Data with Apache Spark
- Write to multiple outputs by key Spark - one Spark job
- Things I Wish I’d Known About Spark When I Started (One Year Later Edition)
- Brian Clapper—Spark for Scala Developers
- Movie recommendation using Apache Spark
- NPE from Spark App that extends scala.App
- 입 개발 spark-submit 시에 –properties-file 와 파라매터에서의 우선 순위
- Which Language to choose when working with Apache Spark
- Procesando Datos con Spark
- Which Career Should I Choose — Hadoop Admin or Spark Developer?
- Efficient geospatial analysis with Spark
- Dealing with null in Spark
- '.NET for Apache Spark' Debuts for C#/F# Big Data
- Announcing Version 1.0 of .NET for Apache Spark | .NET Blog
- Parallel Cross Validation in Spark
- Vedant Jain: Smart Streams: A Real-time framework for scoring Big and Fast Data | PyData Miami 2019
- Jakub Hava: Productionizing H2O Models with Apache Spark | PyData Miami 2019
- Spark로 알아보는 빅데이터 처리
- Multi Source Data Analysis using Spark and Tellius : Meetup Video
- Spark 성능 최적화 및 튜닝 방법 - Part 1
- Master Spark fundamentals & optimizations
- Apache Spark Optimization Techniques | by Nabarun Chakraborti | Jun, 2020 | Medium
- ClickHouse Clustering for Spark Developer
- Data Modeling in Apache Spark - Part 1 : Date Dimension
- Data Modeling in Apache Spark - Part 2 : Working With Multiple Dates
- Concurrency in Spark
- Big Data file formats explained
- Why Spark on Ceph? (Part 1 of 3)
- Why Spark on Ceph? (Part 2 of 3)
- Why Spark on Ceph? (Part 3 of 3)
- Spark Delight — We’re building a better Apache Spark UI | by Jean Yves | Jun, 2020 | Towards Data Science
- Overcoming Apache Spark’s biggest pain points | by Edson Hiroshi Aoki | Oct, 2020 | Towards Data Science
- Speeding Time to Insight with a Modern ETL Approach - YouTube ETL -> ELT
- Scale-Out Using Spark in Serverless Herd Mode! - YouTube
- DBIOTransactionalCommit - Databricks
- 입 개발 EMR에서는 sc.addFile, Databricks에서는 그냥 dbfs 폴더를 이용하자. | Charsyam's Blog
- Spark interview Q&As with coding examples in Scala - part 1 | Java-Success.com
- How to Extract Deeper Value from Data in Legacy Applications with Analytics in a Cloud Data Lake - YouTube
- 크몽 데이터 레이크 구축 방법 | by jun yeong park | Jan, 2024 | Medium
- Scala 3 and Spark?. After the release of Scala 3, one of… | by Filip Zybała | VirtusLab | Oct, 2021 | Medium
- Using Scala 3 with Spark | 47 Degrees
- Apache Spark #1 - 아키텍쳐 및 기본 개념
- Practical Spark – Intro (1) – 1ambda
- Practical Spark – Tutorial (2) – 1ambda
- Practical Spark – Concept (3) – 1ambda
- Practical Spark – Architecture (4) – 1ambda
- Practical Spark – DataFrame (5) – 1ambda
- Practical Spark – Persistence (6) – 1ambda
- Practical Spark – Cache (7) – 1ambda
- Practical Spark – SQL & Table (8) – 1ambda
- Practical Spark – Join (9) – 1ambda
- Practical Spark – Memory (10) – 1ambda
- Practical Spark – Versions (11) – 1ambda
- Practical Spark – 자주 묻는 질문들 (12) – 1ambda
- 40+ Apache Spark best practices & optimisation interview FAQs - Part-2 Spark UI | Java-Success.com
- Salting: The Secret Ingredient to Optimize Your Apache Spark Workflows | by Sukumaar Mane | Jun, 2023 | Medium
- Why Scala Dominates Data Engineering | by Henri Happonen | Sep, 2023 | Medium
- Spark에서 Text data source supports only a single column, and you have 2 columns 에러 메시지 | Popit
- Data Engineering in the Age of AI: Data Intelligence Platforms - YouTube
- Apache Livy A REST Service for Apache Spark
- Apache Livy에서 Spark job stdout log를 보는 법 - Nephtyw’S Programming Stash
-
Spark Programming Model : Resilient Distributed Dataset (RDD) - 2015
-
backtobazics.com/category/big-data/spark example of API
-
Exploring Spark DataSource V2
-
aggregate
scala> val rdd = sc.parallelize(List(1, 2, 3, 3)) rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[2] at parallelize at <console>:21 scala> rdd.aggregate((0, 0))((x, y) => (x._1 + y, x._2 - y), (x, y) => (x._1 + y._1, x._2 + y._2)) res10: (Int, Int) = (9,-9) scala> rdd.map(t => (t, -t)).reduce((a, b) => (a._1 + b._1, a._2 + b._2)) res11: (Int, Int) = (9,-9)
-
aggregateByKey
-
Array Deep Dive into Apache Spark Array Functions | by Neeraj Bhadani | Expedia Group Technology | Medium
-
combineByKey
-
DataFrames
- Spark SQL, DataFrames and Datasets Guide
- spark2.0 dataframe의 filter,where,isin,select,contains,col,between,withColumn, 예제
- Spark: Connecting to a jdbc data-source using dataframes
- 입 개발 Spark 에서 Database 빨리 덤프하는 법(Parallelism) | Charsyam's Blog Spark JDBC
- where과 filter의 차이
- Using spark data frame for sql
- Selecting Dynamic Columns In Spark DataFrames (aka Excluding Columns)
- Spark: Elegantly Aggregate DataFrame by One Key Column
- A practical introduction to Spark’s Column- part 1
- A practical introduction to Spark’s Column- part 2
- Different approaches to manually create Spark DataFrames
- Sending Spark DataFrame via mail
- How I achieved 3x speedup for joins over Spark dataframes
- Deep dive into Apache Spark Window Functions | by Neeraj Bhadani | Expedia Group Technology | Medium
- Making the Spark DataFrame composition type safe(r) | by Iaroslav Zeigerman | Feb, 2021 | Medium
- How to add row numbers to a Spark DataFrame? | Data Programmers
-
Datasets
- Introducing Spark Datasets
- Spark SQL, DataFrames and Datasets Guide
- RDDs, DataFrames and Datasets in Apache Spark - NE Scala 2016
- Spark2.0 New Features
- Transforming Spark Datasets using Scala transformation functions
- (2) Solution to Spark Auto Schema inference (String) for JSON Array / JSON Object/Record/Row Problem | LinkedIn
-
distinct
-
groupByKey
-
HashPartitioner
-
join
-
persist
-
SQL
- Spark SQL, DataFrames and Datasets Guide
- spark-csv - CSV Data Source for Apache Spark 1.x
- Spark SQL CSV Examples
- github.com/yhuai/spark/tree/eb77ee39b8616cb367541503baf7c07695ef1ec0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv
- Dataframes from CSV files in Spark 1.5: automatic schema extraction, neat summary statistics, & elementary data exploration
- Spark 2.0 read csv number of partitions (PySpark)
- How to read csv file as DataFrame?
- How to change column types in Spark SQL's DataFrame?
- Working with Nested Data Using Higher Order Functions in SQL on Databricks
- Hadoop과 Spark은 nested structs, array, map 등과 같은 복잡하고 다양한 데이터를 처리하는 훌륭한 도구이지만 SQL에서 사용하는 건 어려움
- Databricks 3.0에 추가된 TRANSFORM 연산과 Spark SQL에 추가된 "Higher Order Functions"를 소개(SPARK-19480)
- Spark SQL under the hood – part I
- Five Spark SQL Utility Functions to Extract and Explore Complex Data Types - Tutorial on how to do ETL on data from Nest and IoT Devices
- Querying our Data Lake in S3 using Zeppelin and Spark SQL
- Learning Spark SQL with Zeppelin
- SQL Pivot: Converting Rows to Columns 2.4
- SQL at Scale with Apache Spark SQL and DataFrames — Concepts, Architecture and Examples
- A Deep Dive into Query Execution Engine of Spark SQL - Maryann Xue
- A Deep Dive into Spark SQL's Catalyst Optimizer - Yin Huai
-
trigger Spark Trigger Options
- 더북(TheBook): 스파크를 다루는 기술 4~6장만
- Mastering Apache Spark 2.0
- Advanced Analytics with Spark Source Code
- Best Apache Spark and Scala Books for Mastering Spark Scala
- Spark for Data Analyst Spark SQL
- Scaling Machine Learning with Spark • Adi Polak & Holden Karau • GOTO 2023 - YouTube
- Spark Day 2017@Seoul(Spark Bootcamp)
- Spark Day 2017- Spark 의 과거, 현재, 미래
- Spark & Zeppelin을 활용한 한국어 텍스트 분류
- Zeppelin 노트북: NSMC Word2Vec & Sentiment Classification
- Spark, Mesos, Zeppelin, HDFS를 활용한 대용량 보안 데이터 분석
- 2020 데이터 컨퍼런스 "Spark+Cassandra 기반 빅데이터를 활용한 추천시스템 서빙 파이프라인 최적화" / 박수성 SSG.COM 파트너 - YouTube
- Tale of Scaling Zeus to Petabytes of Shuffle Data @Uber - YouTube
- Sub-Second Analytics for User-Facing Applications with Apache Spark and Rockset - YouTube
- Denny Lee & Ginger Holt - Use Spark from Anywhere | Scala Days 2023 Seattle - YouTube
- Databricks Connect v2 Quickstart - YouTube
- Databricks Marketplace - YouTube
- Apache Spark vs cloud-native SQL engines — Franz Wöllert - YouTube
- yahoo/CaffeOnSpark
- CaffeOnSpark Open Sourced for Distributed Deep Learning on Big Data Clusters
- Large Scale Distributed Deep Learning on Hadoop Clusters
- SparkNet: Training Deep Networks in Spark
- large scale deep-learning_on_spark
- DeepSpark: Spark-Based Deep Learning Supporting Asynchronous Updates and Caffe Compatibility
- The Unreasonable Effectiveness of Deep Learning on Spark
- GPU Acceleration in Databricks Speeding Up Deep Learning on Apache Spark
- Deep Learning on Databricks - Integrating with TensorFlow, Caffe, MXNet, and Theano
- Deep Learning With Apache Spark
- Deep Learning Pipelines for Apache Spark
- practice
- DIT4C image for Apache Zeppelin
- hub.docker.com/r/k3vin/polynote-spark
- spark-scala-tutorial A free tutorial for Apache Spark docker jupyter notebook
- Apache Spark on Docker
- Distributed Pricing Engine using Dockerized Spark on YARN w/ HDP 3.0
- Getting Started with PySpark for Big Data Analytics, using Jupyter Notebooks and Docker
- DIY: Apache Spark & Docker. Set up a Spark cluster in Docker from… | by Shane De Silva | Towards Data Science
- GraphX
- Spark Streaming and GraphX at Netflix - Apache Spark Meetup, May 19, 2015
- 스사모 테크톡 - GraphX
- Computing Shortest Distances Incrementally with Spark
- Strata 2016 - This repo is for MLlib/GraphX tutorial in Strata 2016
- Processing Hierarchical Data using Spark Graphx Pregel API
- GraphX API를 사용하는 예제와 방법
- Community detection in graph Girvan newman algorithm
- example
- I simple API to interact with HBase with Spark
- Apache Spark Comes to Apache HBase with HBase-Spark Module
- HBase Integration with Spark | How to Integrate HBase with Spark | Spark Integration with HBase
- How to create Spark Dataframe on HBase table
- HDFS 쓰기 파이프라인을 활용한 HBase의 WAL 쓰기 최적화
- HBase 오픈소스 전환을 위한 HBH(HitBase Handler) 개발기
- Ignite - Spark Shared RDDs
- Installing Apache Spark 2.3.0 on macOS High Sierra
- How to install and run Spark 2.0 on HDP 2.5 Sandbox
- Apache Spark installation on Windows 10
- Spark StandAlone 설치부터 예제 테스트까지
- Hadoop, Spark 설치
- Spark (scala) 개발환경 설정 (window)
- How to Install Scala and Apache Spark on MacOS
- Apache Spark setup with Gradle, Scala and IntelliJ
- Create Spark Scala SBT project in Intellij Idea. 1-minute tutorial - YouTube
- pocketcluster - One-Step Spark/Hadoop Installer v0.1.0
- Spark 2: How to install it on Windows in 5 steps
- Apache Spark Setup in Windows|Intellij IDE|CommandLine|Databricks|Zeppelin|All Methods Covered 2021. - YouTube
- IntelliJ로 Spark 개발 환경 구축하기
- Introduction to Spark on Kubernetes
- What’s New for Apache Spark on Kubernetes in the Upcoming Apache Spark 2.4 Release
- 2.4 preview. Kubernetes 지원 강화, PySpark/Spark R 지원 추가 등
- Spark day 2017@Seoul - Spark on Kubernetes
- Scalable Spark Deployment using Kubernetes
- Docker Image and Kubernetes Configurations for Spark 2.x
- Part 1 : Introduction to Kubernetes
- Part 2 : Installing Kubernetes Locally using Minikube
- Part 3 : Kubernetes Abstractions
- Part 4 : Service Abstractions
- Part 5 : Building Spark 2.0 Docker Image
- Part 6 : Building Spark 2.0 Two Node Cluster
- Part 7 : Dynamic Scaling and Namespaces
- Auto Scaling Spark in Kubernetes
- The anatomy of Spark applications on Kubernetes
- Kubernetes에 대한 Spark의 실험적 지원과 인-클러스터 클라이언트 모드에 대한 향후 지원에 대해 설명
- Spark driver, Executor, Executor Shuffle Service, Resource Staging Server
- How to build Spark from source and deploy it to a Kubernetes cluster in 60 minutes
- Apache Spark workloads on Kubernetes
- Apache Spark Streaming in K8s with ArgoCD & Spark Operator - YouTube
- Spark on Kubernetes - Gang Scheduling with YuniKorn - Cloudera Blog
- Superworkflow of Graph Neural Networks with K8S and Fugue - YouTube word2vec node2vec
- So Long Hadoop - Running Spark On Kubernetes by Erik Schmiegelow - YouTube
- Hadoop Tutorial: the new beta Notebook app for Spark & SQL
- AWS Athena Data Source for Apache Spark
- batch-processing-gateway: The gateway component to make Spark on K8s much easier for Spark users
- BigDL: Distributed Deep learning on Apache Spark
- CLOUD DATAPROC - Google Cloud Dataproc is a managed Spark and Hadoop service that is fast, easy to use, and low cost
- 구글, 스파크·하둡 관리 클라우드 서비스 공개
- [Google Cloud Dataproc 사용하기(http://whitechoi.tistory.com/48)
- couchbase-spark-connector - The Official Couchbase Spark Connector
- CueSheet - a framework for writing Apache Spark 2.x applications more conveniently
- Delta Lake - Reliable Data Lakes at Scale
- Delta Lake on Databricks - Databricks
- Tutorial: How Delta Lake Supercharges Data Lakes - YouTube
- SmartSQL Queries powered by Delta Engine on Lakehouse - YouTube
- Making Apache Spark™ Better with Delta Lake - YouTube
- Tech Talk: Top Tuning Tips for Spark 3.0 and Delta Lake on Databricks - YouTube
- Delta Lakehouse Data Profiler and SQL Analytics Demo - YouTube
- Data Mesh and Lakehouse - Matei Zaharia, Databricks - YouTube
- Optimising Geospatial Queries with Dynamic File Pruning - YouTube
- Demystifying Delta Lake. Data Brew | Episode 3 - YouTube
- Delta Lake on Databricks Demo - YouTube
- Make Reliable ETL Easy on Delta Lake - YouTube
- Building Lakehouses on Delta Lake with SQL Analytics Primer - YouTube
- Massive Data Processing in Adobe Experience Platform Using DeltaLake | by Jaemi Bremner | Adobe Tech Blog | Medium
- Multi-Table Transactions with LakeFS and Delta Lake - YouTube
- Databricks Korea Lakehouse Day 0420.mp4 on Vimeo
- Architecting for Data Quality in the Lakehouse with Delta Lake and PySpark - YouTube
- Massive Data Processing in Adobe Using Delta Lake: One Year In - YouTube
- Simplify ETL pipelines on the Databricks Lakehouse - YouTube
- The Data Lakhouse for Streaming Data - A talk for everyone who ❤️ data by Frank Munz - YouTube
- AI-Accelerated Delta Tables: Faster, Easier, Cheaper - YouTube
- Dr. Elephant Self-Serve Performance Tuning for Hadoop and Spark
- EMR
- Large-Scale Machine Learning with Spark on Amazon EMR
- Amazon EMR, Apache Spark 지원 시작
- Spark on EMR
- (BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR
- Starburst’s Presto on AWS up to 18x faster than EMR Presto의 엔터프라이즈 빌드를 제공하는 Starbust에서 AWS와 EMR 환경에서 벤치마크한 결과 소개
- Optimize Spark jobs on EMR Cluster
- Envelope - a configuration-driven framework for Apache Spark that makes it easy to develop Spark-based data processing pipelines on a Cloudera EDH
- Envelope과 함께 Apache Spark, Apache Kudu 및 Apache Impala를 사용하여 Cloudera enterprise data hub (EDH)에 구현하는 방법
- Configuration specification
- Bi-temporal data modeling with Envelope
- Cloudera Enterprise Data Hub - Our flagship can now be yours
- flambo - A Clojure DSL for Apache Spark
- GraphFrames: DataFrame-based Graphs
- Hail: Scalable Genomics Analysis with Apache Spark
- Apache Spark로 유전체 분석을 수행하는 도구 인 Hail에 대한 개요
- 샘플의 품질을 계산하고 간단한 게놈 차원의 연관 연구를 수행하는 예제 실행으로 시연하는 간단하고 강력한 프로그래밍 모델을 보유
- Hudi - Spark Library for Hadoop Upserts And Incrementals https://uber.github.io/hudi
- The Evolution of Uber’s 100+ Petabyte Big Data Platform
- Marmaray: An Open Source Generic Data Ingestion and Dispersal Framework and Library for Apache Hadoop
- Peloton: Uber’s Unified Resource Scheduler for Diverse Cluster Workloads
- Building an analytical data lake with Apache Spark and Apache Hudi - Part 1
- Hydrogen
- Infinispan Spark connector 0.1 released!
- IMLLIB - Factorization Machines (LibFM) Field-Awared Factorization Machine (FFM) Conditional Random Fields (CRF) Adaptive learning rate optimizer (AdaGrad, Adam)
- Lighthouse - a library for data lakes built on top of Apache Spark
- Livy, the Open Source REST Service for Apache Spark, Joins Cloudera Labs
- MapR-DB Spark Connector with Secondary Indexes
- native_spark - new arguably faster implementation of Apache Spark from scratch in Rust
- snappydata - Unified Online Transactions + Analytics + Probabilistic Data Platform
- spark-annoy: Building Annoy Index on Apache Spark
- spark cassandra connector - 스파크에 카산드라 연동하는 라이브러리
- spark-fatJAR-example: scala-spark build fat-jar example
- spark-indexed - An efficient updatable key-value store for Apache Spark
- Sparkline SNAP
- spark-nkp Natural Korean Processor for Apache Spark
- Spark Notebook
- SparMysqlSample
- spark-nlp - Natural Language Understanding Library for Apache Spark
- Spark NLP: Getting Started With The World’s Most Widely Used NLP Library In The Enterprise
- Spark NLP 101: Document Assembler
- Spark NLP: Installation on Mac and Linux (Part-II)
- Introduction to Spark NLP: Foundations and Basic Components
- Spark NLP 101: LightPipeline
- Spark in Docker in Kubernetes: A Practical Approach for Scalable NLP | by Jürgen Schmidl | Towards Data Science
- spark-packages - A community index of packages for Apache Spark
- spark-ts - Time Series for Spark (The spark-ts Package)
- spark-xml - XML data source for Spark SQL and DataFrames
- StreamSets Transformer - an execution engine within the StreamSets DataOps platform that allows any user to create data processing pipelines that execute on Spark
- zio
- Deep Dive into Monitoring Spark Applications Using Web UI and SparkListeners (Jacek Laskowski)
- Apache Spark performance - All relevant key performance metrics about your Apache Spark instance in minutes
- HTRACE TUTORIAL: HOW TO MONITOR YOUR DISTRIBUTED SYSTEMS
- delight: A Spark UI and Spark History Server alternative with CPU and Memory metrics! Delight is free, cross-platform, and open-source
- spark-dependencies - Spark job for dependency links http://jaegertracing.io
- spark-jobs-rest-client - Fluent client for interacting with Spark Standalone Mode's Rest API for submitting, killing and monitoring the state of jobs
- Sparklint - The missing Spark Performance Debugger that can be drag and dropped into your spark application!
- sparkoscope - Enabling Spark Optimization through Cross-stack Monitoring and Visualization
- zipkin-dependencies - Spark job that aggregates zipkin spans for use in the UI
- BerkeleyX: CS190.1x Scalable Machine Learning
- Feature Engineering at Scale With Spark
- Audience Modeling With Spark ML Pipelines
- Spark + AI Summit 2018 — Overview
- Using Native Math Libraries to Accelerate Spark Machine Learning Applications
- Spark ML용 네이티브 라이브러리를 사용해 모델 훈련 속도를 높이는 방법
- 네이티브 라이브러리가 Spark ML에 이점이 되는 이유
- CDH Spark로 네이티브 라이브러리를 활성화하는 방법
- 여타 네이티브 라이브러리 사용 시 Spark ML 성능과의 비교 분석
- Machine Learning with Jupyter using Scala, Spark and Python: The Setup
- Spark Day 2017 Machine Learning & Deep Learnig With Spark
- Building a Big Data Machine Learning Spark Application for Flight Delay Prediction
- Apache Spark 2.0 Preview: Machine Learning Model Persistence by Databricks
- Ranking Algorithms for Spark Machine Learning Pipeline BM 25 + Wilson score on spark 2.2.0
- An Introduction to Machine Learning with Apache Spark™
- Multiple Column Feature Transformations in Spark ML
- End to End Spark TensorFlow PyTorch Pipelines with Databricks DeltaJim Dowling Logical Clocks ABKim
- Accelerating Deep Learning on the JVM with Apache Spark and NVIDIA GPUs
- Spark ML hyperparameter tuning
- Scaling and Unifying SciKit Learn and Apache Spark Pipelines - YouTube
- Sawtooth Windows for Feature Aggregations - YouTube
- Run Your Queries Instantly in One of the Most Optimized Environments - YouTube Nephos
- KeystoneML - Machine Learning Pipeline
- Meson: Netflix's framework for executing machine learning workflows
- MLflow
- MLLib
- Decision Trees
- MLlib: Machine Learning in Apache Spark
- movie recommendation with mllib
- WSO2 Machine Learner: Why would You care?
- Strata 2016 - This repo is for MLlib/GraphX tutorial in Strata 2016
- Spark ML Lab
- Machine Learning with Spark (Spark로 머신러닝하기)
- Apache Spark로 시작하는 머신러닝 입문
- Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
- Introduction to Machine Learning on Apache Spark MLlib
- Introduction to Machine learning with Spark
- Introduction to ML with Apache Spark MLib by Taras Matyashovskyy
- pipelineio - End-to-End Spark ML and Tensorflow AI Data Pipelines
- Extend Spark ML for your own model/transformer types
- Accelerating Apache Spark MLlib with Intel® Math Kernel Library (Intel® MKL)
- Improving BLAS library performance for MLlib
- Extend Spark ML for your own model/transformer types
- Machine Learning with Apache Spark
- Building A Linear Regression with PySpark and MLlib
- How to run Linear Regression in Python using PySpark - YouTube
- Building Custom ML PipelineStages for Feature Selection - Marc Kaminski
- Machine Learning with PySpark and MLlib — Solving a Binary Classification Problem
- Dataset deduplication using spark’s MLlib
- Deep Learning with Apache Spark and TensorFlow
- TensorFlow On Spark: Scalable TensorFlow Learning on Spark Clusters - Andy Feng & Lee Yang
- github.com/yahoo/TensorFlowOnSpark
- Deep learning for Apache Spark
- Spark machine learning & deep learning
- Spark Deep Learning Pipelines
- Deep Learning With Apache Spark
- Converting Spark ML Vector to Numpy Array
- PyData Tel Aviv Meetup: Learning Large Scale Models for Content Recommendation - Sonya Liberman
- MMLSpark - Microsoft Machine Learning for Apache Spark
- Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning http://oryx.io
- Production Recommendation Systems with Cloudera 기계 학습 기능을 위한 인프라 및 데이터 파이프라인을 구축하기 위해 Cloudera Oryx 프로젝트를 사용하는 예제
- Kafka + Spark + Cloudera Hadoop 를 이용한 추천시스템
- raydp: RayDP: Distributed data processing library that provides simple APIs for running Spark on Ray and integrating Spark with distributed deep learning and machine learning frameworks
- spark-vlbfgs - an implementation of the Vector-free L-BFGS solver and some scalable machine learning algorithms for Apache Spark
- TransmogrifAI Chetan Khatri - TransmogrifAI - Automate ML Workflow with power of Scala and Spark at massive scale
- PySpark
- PySpark & Hadoop: 1) Ubuntu 16.04에 설치하기
- PySpark & Hadoop: 2) EMR 클러스터 띄우고 PySpark로 작업 던지기
- PySpark Cheat Sheet: Spark in Python
- Big Data Analytics using Python and Apache Spark | Machine Learning Tutorial
- troubleshooting
- A Beginner's Guide on Troubleshooting Spark Applications
Caused by: java.lang.ClassNotFoundException: * org.elasticsearch.spark.package
sbt configuration such as resolversjava.lang.OutOfMemoryError: GC overhead limit exceeded
increase driver memoryorg.apache.spark.SparkException: Could not find BlockManagerEndpoint1 or it has been stopped
검색해도 특별히 나오는게 없음spark java.io.IOException: Filesystem closed
usually result RDD is too bigTask not serializable
- Spark - Task not serializable: How to work with complex map closures that call outside classes/objects?
- Task not serializable: java.io.NotSerializableException when calling function outside closure only on classes not objects
- java+spark: org.apache.spark.SparkException: Job aborted: Task not serializable: java.io.NotSerializableException
TypeError: 'bool' object is not callable
UsePYSPARK_PYTHON=...
yarn.scheduler.maximum.allocation-mb
- increase configuration for yarn-site.xml
- empty disk (not enough free space may cause this too)
- Cannot submit Spark app to cluster, stuck on “UNDEFINED”
yarn.nodemanager.resource.memory-mb
조정 후 동작 확인
contains a task of very large size warning
- 문제; Dataframe으로 읽어 온 row들을 텍스트 처리 해서 row끼리 비교를 해야 하는데, a task of very large size warning 발생
- 해결; 텍스트 처리 된 중간 결과물을 Redis에 저장한 뒤 별도 Spark 애플리케이션을 사용해서 Row by Row 처리
- 원인
- Spark는 각 Executor가 수행해야 할 작업을 Task라는 단위로 관리
- RDD에 가해지는 연산을 상호 의존성에 따라 묶은 뒤 (Logical Planning) 여기에 최적화 룰을 적용해서 실제로 Executor가 처리해야 할 Task의 형태로 생성 (Physical Planning)
- 이걸 내부 queue에 넣어 뒀다가 순차적으로 Executor에 보내서 처리
- 이 과정을 좀 더 구체적으로 설명하자면, Driver 프로세스가 작업 루틴과 작업 대상 위치를 TaskDescription 객체로 만든 뒤 Serialize를 해서 Worker 프로세스에 네트워크 상으로 전송
- 문제는 Task당 100kb를 넘으면 "contains a task of very large size warning" 경고 발생
- 이 제한은 소스코드 안에 하드 코딩되어 있어 변경 불가능
- broadcast 기능을 사용할 경우 상황은 더 악화
- broadcast 기능은 task를 전송할 때와는 달리 데이터 값 그 자체를 Worker에 하나하나 보내는 방식으로 동작
- 이 경우 보내야 할 row가 한두 개가 아니므로, 당연히 성능에 문제 발생
- 이런 이유 때문에 자연어 처리가 된 중간 결과물을 별도 스토리지에 저장한 뒤 별도 애플리케이션에서 읽어와서 처리하는 방법만 가능
- 여러 storage 중에서 굳이 Redis를 추천하는 이유는 빠르고, Key-Value Store라 관리하기 좋고, Sharding 기능 덕분에 읽기 분산도 잘 동작하기 때문
- 최근 Spark ML에서 학습된 모델이 Redis에 저장되는 식으로 개발되고 있음
- Spark Interpreter 이슈 해결
- Getting started with PySpark - Part 1
- Getting started with PySpark - Part 2
- PySpark Internals
- Fast Data Analytics with Spark and Python
- pyspark-hbase.py
- Deploying PySpark on Red Hat Storage GlusterFS
- practice - weird case from pyspark-hbase (utf8 & unicode mixed)
- Python Versus R in Apache Spark
- biospark
- Plagiarizing and Paraphrasing Code From an Online Class for Content Marketing
- How-to: Use IPython Notebook with Apache Spark
- Configuring IPython Notebook Support for PySpark
- pyADAM - This is a wrapper to load Parquet data in PySpark
- PySpark: 손상된 parquet파일 무시하기
- Accessing PySpark in PyCharm
- pyspark-project-example - A simple example for PySpark based project
- Recommendation Systems for Implicit Feedback
- Hassle Free ETL with PySpark
- 안명호 : Python + Spark, 머신러닝을 위한 완벽한 결혼 - PyCon APAC 2016
- Fully Arm Your Spark with Ipython and Jupyter in Python 3
- PySpark Cheat Sheet: Spark in Python
- Apache Spark for Data Science
- BigDL on CDH and Cloudera Data Science Workbench BigDL (Apache Spark의 심층 학습 라이브러리)을 워크 벤치와 함께 사용하는 방법
- Distributed Deep Learning At Scale On Apache Spark With BigDL
- Deep Learning to Big Data Analytics on Apache Spark Using BigDL - Yuhao Yang & Xianyan Jia
- Deep Learning on Qubole Using BigDL for Apache Spark – Part 2
- 딥러닝 라이브러리인 BigDL을 사용하여 모델을 학습하고 평가하는 방법을 보여주는 간단한 자습서
- Use your favorite Python library on PySpark cluster with Cloudera Data Science Workbench Python 라이브러리를 사용하는 PySpark 작업을 작성하는 방법
- Install Spark on Windows (PySpark)
- pyspark 로컬 설치
- Get Started with PySpark and Jupyter Notebook in 3 Minutes
- Best Practices Writing Production-Grade PySpark Jobs
- How to use PySpark on your computer
- Spark Python Performance Tuning
- Getting The Best Performance With PySpark
- Improving Python and Spark Performance and Interoperability: Spark Summit East talk by Wes McKinney
- High Performance Python On Spark
- Comparing Performance between Apache Spark and PySpark
- Keynote: Making the Big Data ecosystem work together with Python - Holden Karau
- Downloading spark and getting started with python notebooks (jupyter) locally on a single computer
- A Brief Introduction to PySpark - A primer on PySpark for data science
- Introducing Pandas UDF for PySpark
- Reading CSV & JSON files in Spark – Word Count Example
- How to Upload/Download Files to/from Notebook in my Local machine
- Analyze MongoDB Logs Using PySpark
- Real-world Python workloads on Spark: EMR clusters
- First Steps With PySpark and Big Data Processing
- New Pandas UDFs and Python Type Hints in the Upcoming Release of Apache Spark 3.0™
- An Introduction to Pandas UDFs in PySpark | by Suffyan Asad | Medium
- How to create a simple ETL Job locally with PySpark, PostgreSQL and Docker
- Data Collab Lab: Automate Data Pipelines with PySpark SQL - YouTube
- Data Quality: Especially important with the medallion architecture with PySpark data testing - YouTube
- How to Manage Python Dependencies in Spark - The Databricks Blog
- 데이터 분석 라이브러리 개발기 (1)
- 데이터 분석 라이브러리 개발기 (2) - 통합 테스팅과 문서화를 동시에 잡는 방법
- 04b: Databricks – Spark SCD Type 2 with Merge | Java-Success.com
- 웹로그 히스토리 데이터를 통한 데이터 분석 꼼수 : 네이버 블로그
- Pandas API on Apache Spark - Part 1: Introduction
- Pandas API on Apache Spark - Part 2: Hello World
- Simplifying Testing of Spark Applications - Megan Yow | PyData Global 2021 - YouTube
- Pyspark Functions - YouTube
- How to Convert Pandas DataFrame to Spark DataFrame | Using PySpark - YouTube
- Xinrong Meng & Takuya Ueshin - Scale Data Science by Pandas API on Spark | PyData Global 2022 - YouTube
- Why Delta Lake is the Best Storage Format for Pandas Analyses - YouTube
- How PySpark Self-Join Simplifies Data Flattening - YouTube
- How to do Topic Modelling in Python using PySpark LDA - YouTube
- Use Spark from anywhere: A Spark client in Python powered by Spark Connect - YouTube
- PyCon KR 2023 pandas와 PySpark로 데이터 워크로드 확장하기 권혁진 - YouTube
- Pyspark - How to preprocess Large Scale Data with Python
- Koalas: pandas API on Apache Spark
- Koalas: Easy Transition from pandas to Apache Spark
- 10 Minutes from pandas to Koalas on Apache Spark With demonstrable Python how-to Koalas code snippets and Koalas best practices
- New Developments in the Open Source Ecosystem: Apache Spark 3 0, Delta Lake, and Koalas
- pandas 코드로 대규모 클러스터에서 더 빠르게 빅데이터를 분석 해보자 - Koalas - 박현우 - PyCon Korea 2020 - YouTube
- The Jungle of Koalas, Pandas, Optimus and Spark | by Favio Vázquez | Towards Data Science
- Project Zen: Making Spark Pythonic | Reynold Xin | Keynote Data + AI Summit EU 2020 - YouTube
- Petastorm - a library enabling the use of Parquet storage from Tensorflow, Pytorch, and other Python-based ML training frameworks
- pyspark-ai: English SDK for Apache Spark
- quokka: Open source SQL engine in Python
- Snowflake
- Read from Kafka & Write to Snowflake via Spark Databricks | LinkedIn
- 입 개발 Spark SQL Query to Snowflake Query | Charsyam's Blog
- Snowflake 와 Spark on EMR 연동하기. Snowflake Connector for Spark 와 JDBC… | by MJ Lee | Snowflake Korea | Aug, 2022 | Medium
- Snowflake Cleanroom : 다른 회사와 서로 데이터를 노출하지 않으면서 서로의 데이터를 활용할 수 있는 방법 | Snowflake Korea
- ChatGPT와 함께하는 Snowflake 활용. 22년 11월에 출시한 ChatGPT은 매우 놀라운 기능을 제공하고… | by juyun hwang | Snowflake Korea | Feb, 2023 | Medium
- Allan Campopiano - Machine Learning in the Warehouse with Python | PyData Global 2022 - YouTube
- Learn About Snowpark For Python In 2 Minutes: Scale Your Machine Learning Workflows - YouTube
- 이제 모든 Snowflake 고객에게 Python 혁신이 제공된다
- Databricks vs Snowflake: Which platform is best for you? - Beyond the Horizon...
- Snowflake Python에서 사용하는 법(Python Connector, Snowpark) · 어쩐지 오늘은
- Spark 1.4 for RStudio
- Python Versus R in Apache Spark
- SparkR 설치 사용기 1 - Installation Guide On Yarn Cluster & Mesos Cluster & Stand Alone Cluster
- MS R(구 Revolution R) on Spark - 설치 및 가능성 엿보기(feat. SparkR)
- sparklyr — R interface for Apache Spark
- sparklyr — R interface for Apache Spark
- sparklyr
- xwMOOC 기계학습 - dplyr을 Spark 위에 올린 sparklyr
- sparklyr – An R interface for Apache Spark
- spark + R
- 빅데이터 분석을 위한 스파크 2 프로그래밍 : 대용량 데이터 처리부터 머신러닝까지
- On-Demand Webinar and FAQ: Parallelize R Code Using Apache Spark
- Vectorized R Execution in Apache Spark - Hyukjin Kwon (Databricks)
- How to Improve R Performance in SparkR at Apache Spark 3.0
- Data Source V2 API in Spark 3.0 - Part 1 : Motivation for New Abstractions
- Data Source V2 API in Spark 3.0 - Part 2 : Anatomy of V2 Read API
- Data Source V2 API in Spark 3.0 - Part 3 : In-Memory Data Source
- Data Source V2 API in Spark 3.0 - Part 4 : In-Memory Data Source with Partitioning
- Data Source V2 API in Spark 3.0 - Part 5 : Anatomy of V2 Write API
- Data Source V2 API in Spark 3.0 - Part 6 : MySQL Source
- Introduction to Spark 3.0 - Part 1 : Multi Character Delimiter in CSV Source
- Introduction to Spark 3.0 - Part 2 : Multiple Column Feature Transformations in Spark ML
- Introduction to Spark 3.0 - Part 3 : Data Loading From Nested Folders
- Introduction to Spark 3.0 - Part 4 : Handling Class Imbalance Using Weights
- Introduction to Spark 3.0 - Part 5 : Easier Debugging of Cached Data Frames
- Introduction to Spark 3.0 - Part 6 : Min and Max By Functions
- Introduction to Spark 3.0 - Part 7 : Dynamic Allocation Without External Shuffle Service
- Introduction to Spark 3.0 - Part 8 : DataFrame Tail Function
- Introduction to Spark 3.0 - Part 9 : Join Hints in Spark SQL
- Introduction to Spark 3.0 - Part 10 : Ignoring Data Locality in Spark
- Spark Plugin Framework in 3.0 - Part 1: Introduction
- Spark Plugin Framework in 3.0 - Part 2 : Anatomy of the API
- Spark Plugin Framework in 3.0 - Part 3 : Dynamic Stream Configuration using Driver Plugin
- Spark Plugin Framework in 3.0 - Part 4 : Custom Metrics
- Spark Plugin Framework in 3.0 - Part 5 : RPC Communication
- Adaptive Query Execution in Spark 3.0 - Part 1 : Introduction
- Adaptive Query Execution in Spark 3.0 - Part 2 : Optimising Shuffle Partitions
- AQE: Coalescing Post Shuffle Partitions – tech.kakao.com
- Distributed TensorFlow on Apache Spark 3.0
- Barrier Execution Mode in Spark 3.0 - Part 1 : Introduction
- Barrier Execution Mode in Spark 3.0 - Part 2 : Barrier RDD
- Webinar: A preview of Apache Spark 3.0
- Spark & AI summit and a glimpse of Spark 3.0 - Towards Data Science
- Spark 3.0에 새로 추가된 기능 소개 및 설명 - Nephtyw’S Programming Stash
- NVIDIA Accelerates Spark Data Analytics Platform | NVIDIA Blog
- Spark 3.0 — New Functions in a Nutshell - Javarevisited - Medium
- Spark & AI summit and a glimpse of Spark 3.0 | by Adi Polak | Towards Data Science
- Apache Spark 3.0 변경 사항
- Apache Spark 3.0 Exciting Capabilities | by Teepika R M | Jun, 2022 | Medium
- Spark SQL, DataFrames and Datasets Guide
- Deep Dive into Spark SQL’s Catalyst Optimizer
- DataFrame이 RDD와 다르게 최적화를 적용할 수 있는 이유
- SparkSQL cacheTable 메소드 사용 성능 비교 - default vs cacheTable vs cacheTable (with columnar Compression)
- SparkSQL Internals
- Spark Data Source API. Extending Our Spark SQL Query Engine
- Five Spark SQL Utility Functions to Extract and Explore Complex Data Types
- JSON 및 중첩 구조를 처리하기 위해 탑재된 Spark SQL 함수를 사용하기 위한 튜토리얼
- Spark SQL: Another 16x Faster After Tungsten
- Windowing Functions in Spark SQL Part 1 | Lead and Lag Functions | Windowing Functions Tutorial
- Windowing Functions in Spark SQL Part 2 | First Value & Last Value Functions | Window Functions
- Windowing Functions in Spark SQL Part 3 | Aggregation Functions | Windowing Functions Tutorial
- Windowing Functions in Spark SQL Part 4 | Row_Number, Rank and Dense_Rank in SQL
- Simplifying Change Data Capture with Databricks Delta
- Spark DataFrameWriter에서 saveAsTable 의 동작
- Dynamic Shuffle Partitions in Spark SQL
- Eliminating Shuffles in Delete Update, and Merge - YouTube
- Tech Chat: Faster Spark SQL: Adaptive Query Execution in Databricks - YouTube
- Sentiment Analysis on Demonetization in India using Apache Spark - Projects Based Learning
- HiveQL을 Spark SQL로 이전 시 발생하는 문제 해결하기
- FLARE: SCALE UP SPARK SQL WITH NATIVE COMPILATION AND SET YOUR DATA ON FIRE!
- 실험 단계
- 쿼리플랜을 native code로 바꾸고 spark runtime system도 수정해 Spark SQL성능을 대폭 향상
- Flare: Native Compilation for Heterogeneous Workloads in Apache Spark
- MatFast: In-Memory Distributed Matrix Computation Processing and Optimization Based on Spark SQL
- Improved Fault-tolerance and Zero Data Loss in Spark Streaming
- Four Things to know about Reliable Spark Streaming
- Improved Fault-tolerance and Zero Data Loss in Spark Streaming
- Real Time Data Processing using Spark Streaming | Data Day Texas 2015
- Real-Time Analytics with Spark Streaming
- Can Spark Streaming survive Chaos Monkey?
- RecoPick 실시간 데이터 처리 시스템 전환기 (Storm에서 Spark Streaming으로 전환)
- From Big Data to Fast Data in Four Weeks or How Reactive Programming is Changing the World – Part 2
- Spark Streaming으로 유실 없는 스트림 처리 인프라 구축하기
- Real-time Streaming ETL with Structured Streaming in Apache Spark 2.1
- Handling empty batches in Spark streaming
- Spark Streaming Example(예제로 알아보는 Spark Streaming)
- Long-running Spark Streaming Jobs on YARN Cluster
- spark-submit으로 장기간 streaming 분석 작업 실행하기
- Spark Streaming 운영과 회고
- Deep Learning and Streaming in Apache Spark 2 x - Matei Zaharia & Sue Ann Hong
- 24/7 Spark Streaming on YARN in Production
- Running multiple Spark Streaming jobs of different DStreams in parallel
- Arbitrary Stateful Processing in Apache Spark’s Structured Streaming
- 'exactly once' 주제에서 Apache Spark의 Structured Streaming으로 중복 제거를 구현하는 방법에 대해 설명
- 워터마크 기반으로 한 중복 제거 외에도 mapGroupsWithState를 사용하여 상태 저장 집계에 사용자 정의 로직을 추가 할 수 있는 방법에 대해 간략하게 설명
- Internals of Spark Streaming
- Why is My Stream Processing Job Slow?
- How we built a data pipeline with Lambda Architecture using Spark/Spark Streaming 월마트 랩에서 Apache Kafka, Spark Streaming/Batch로 Lambda 아키텍처를 구현하기 위해 구축된 A/B 테스트 플랫폼 소개
- Building Realtime Data Pipelines with Kafka Connect and Spark Streaming
- Ingesting Raw Data with Kafka-connect and Spark Datasets
- Introduction to Spark Structured Streaming - Part 15: Meetup Talk on Time and Window API
- 번역글 Spark Streaming의 내부
- Comparing Apache Spark, Storm, Flink and Samza stream processing engines - Part 1 Apache Spark, Storm, Flink, Samze를 비교 분석
- Kafka Streams vs. Spark Structured Streaming
- Kafka Streams vs. Spark Structured Streaming (extended)
- Kafka offset committer for Spark structured streaming
- Structured Streaming은 Kafka 에서 데이터를 가져올 때 사용하는 경우가 많음
- Spark가 Kafka consumer group ID를 임의로 지정하고 commit도 하지 않아 별도의 streaming query listener를 구현해 추적하는 방안 외에는 적당한 방도가 없음
- commit할 group ID를 지정하면 개별 batch의 commit된 offset정보를 Kafka로 commit, 기존 Kafka 툴들과 조합하면 lag등을 추적하는 데 도움
- Scaling Spark Streaming for Logging Event Ingestion
- State Storage in Spark Structured Streaming
- State Management in Spark Structured Streaming
- Watermarking in Spark Structured Streaming
- Structured streaming in a flash
- File sink and Out-Of-Memory risk on waitingforcode.com - articles about Apache Spark Structured Streaming
- 입 개발 Kafka 와 Spark Structured Streaming 에서 checkpoint 에서 아주 과거의 Offset이 있으면 어떻게 동작할까? | Charsyam's Blog
- 입 개발 Spark Structured Streaming 에서 Offset 은 어떻게 관리되는가(아주 간략한 버전)? | Charsyam's Blog
- 입 개발 Spark Kafka Streaming 에서의 BackPressure 에 대한 아주 간단한 정리. | Charsyam's Blog
- Structured Streaming Use-Cases at Apple - YouTube
- Apache Spark - Spark Structured Streaming Kafka Sink는 Exactly-Once를 지원하지 않는다 | leeyh0216's devlog
- 실시간 광고 사용자 ID 매핑
- 네이버 광고 시스템에서는 광고 노출 사용자를 대표할 수 있는 ID를 생성하는 작업 필요
- 실시간 광고 사용자 ID 매핑 시스템은 대량의 이벤트 로그에서 추출한 사용자 ID로 광고 사용자를 대표할 수 있는 그룹 ID 매핑
- 이 글에서는 실시간 광고 사용자 ID 매핑 시스템의 설계부터 각 주요 모듈 소개
- gRPC, Spark Structured Streaming을 이용한 마이크로서비스 아키텍처를 구축한 방법과 사용자 ID를 그래프 구조로 매핑하는 법 포함
- AI 기반 광고 추천 파이프라인에서 스파크 스트리밍의 배포 및 모니터링 전략 / if(kakaoAI)2024 - YouTube
- Unit Testing Apache Spark Applications using Hive Tables
- How I test with Apache Spark?
- Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets
- Running Spark on YARN
- Apache Spark Resource Management and YARN App Models
- Spark-on-YARN: Empower Spark Applications on Hadoop Cluster
- Spark Yarn Cluster vs Spark Mesos Cluster (vs 기타 다양한 모드) 수행성능 및 활용성 비교
- Dynamic Resource Allocation Spark on YARN
- Investigation of Dynamic Allocation in Spark
- Spark Cluster Settings On Yarn : Spark 1.4.1 + Hadoop 2.7.1
- Spark logging configuration in YARN
- Understanding Apache Spark on YARN
- Spark on YARN: a Deep Dive - Sandy Ryza, Cloudera
- Apache Spark Performance Benchmarks show Kubernetes has caught up with YARN - Data Mechanics Blog
- Zeppelin
- Apache Zeppelin Release 0.7.0
- www.zepl.com previously www.zeppelinhub.com
- practice
- Introduction to Zeppelin
- Zeppelin overview
- Zepplin (제플린) 설치하기
- 도커로 간단 설치하는 Zeppelin
- 5. 웹 기반 명령어 해석기 Zeppelin Install
- How-to: Install Apache Zeppelin on CDH
- Angular display system dashboard on Zeppelin
- Apache Zeppelin으로 데이터 분석하기 by VCNC
- Zeppelin Context
- Apache Tajo 데스크탑 + Zeppelin 연동 하기
- 제플린 탑재한 이엠알 16년 4월
- Zeppelin at Twitter
- 아파치 제플린, 한국에서 세계로 가기까지
- Zeppelin Lab
- Presto, Zeppelin을 이용한 초간단 BI 구축 사례
- Presto, Zeppelin을 이용한 초간단 BI 시스템 구축 사례(1)
- Serving Shiro enabled Apache Zeppelin with Apache mod_proxy + SSL (https)
- Analyzing BigQuery datasets using BigQuery Interpreter for Apache Zeppelin
- Zeppelin(제플린) 서울시립대학교 데이터 마이닝연구실 활용사례
- 노트7의 소셜 반응을 분석해 보았다. #3 제플린 노트북을 이용한 상세 분석
- 9월 발렌타인 웨비너 - 민경국님의 Apache Zeppelin 입문 온라인 헨즈온강의
- 오픈소스 일기 2: Apache Zeppelin 이란 무엇인가?
- How Apache Zeppelin runs a paragraph
- Spark & Zeppelin을 활용한 머신러닝 실전 적용기
- 스파크-제플린으로 통계 그래프 출력하기(윈도우환경) 실패 이야기
- Apache Zeppelin Data Science Environment 1/21/16
- Zeppelin Build and Tutorial Notebook
- zdairi is zeppelin CLI tool
- Zeppelin Paragraph 공유 시 자동 로그인 구현
- 25분 만에 Apache Zeppelin 으로 대시보드 만들기 - 박훈(@1ambda)
- Using Amazon Athena with Apache Zeppelin
- ZEPL - How to Configure a JDBC Interpreter
- www.zepl.com/resources how-to videos
- Spark Scala Note 1
- Journey to the Continuous and Scalable Big Data Platform
- Big Data Tools 소개 – IntelliJ IDEA 내에서 Spark 통합 및 Zeppelin 노트북 지원
- K-Means clustering with Apache Spark and Zeppelin notebook on Docker
- Zeppelin notebook shortcuts - Mk’s Blog
- Using Apache Zeppelin with SQL Server | by Mike Moritz | Medium
- 📊데이터 시각화 플랫폼 제플린 #zeppelin #dataviz - YouTube
- 📊제플린 쉽게 시작하기 #zeppelin #dataviz - YouTube
- 📊제플린과 DB 연결하기 #mysql #dataviz - YouTube
- Setup Zeppelin with K8S mode on NAVER Container Cluster | by EuiYul Song | Apr, 2021 | Medium
- Dynamic Forms 동작 시 랜덤하게 paragraph 내용이 사라지는 문제와 임시 해결안 | by Sinjin | Feb, 2021 | Medium
- Incorporating Plotly into your Zeppelin notebooks with Spark and Scala