• 703-743-9010
  • info@oneoffcoder.com
  • 7526 Old Linton Hall Rd, Gainesville VA, 20155

Hadoop + Spark + Python Docker Container

If you want to learn Hadoop, Spark and Python (PySpark), we have published a Docker container to facilitate your learning efforts. The source code is available on GitHub and the container is published on Docker Hub. An example notebook is provided to get you jump started as well (see below).

Basic Spark testing

In [1]:
num_rdd = sc.parallelize(list(range(10000)))
num_rdd.map(lambda x: x * x).reduce(lambda a, b: a + b)
Out[1]:
333283335000

Basic Python testing

In [2]:
import random
import pandas as pd

def get_data():
    n_rows = 100
    n_cols = 10
    for r in range(n_rows):
        yield {'x{}'.format(c): random.randint(0, 100) for c in range(n_cols)}
        
df = pd.DataFrame(get_data())
df.to_csv('data.csv', index=False)

Reading a text/csv file into a Spark dataframe

In [3]:
data_rdd = sc.textFile('hdfs://localhost/data.csv')
In [4]:
data_df = spark.read.load('hdfs://localhost/data.csv', format='com.databricks.spark.csv',header='true',sep=',',inferSchema='true')
In [5]:
data_df.printSchema()
root
 |-- x0: integer (nullable = true)
 |-- x1: integer (nullable = true)
 |-- x2: integer (nullable = true)
 |-- x3: integer (nullable = true)
 |-- x4: integer (nullable = true)
 |-- x5: integer (nullable = true)
 |-- x6: integer (nullable = true)
 |-- x7: integer (nullable = true)
 |-- x8: integer (nullable = true)
 |-- x9: integer (nullable = true)

Graphframes

In [6]:
from pyspark.sql.functions import lit
from graphframes import GraphFrame

v = sqlContext.createDataFrame([
  ("a", "Alice", 34),
  ("b", "Bob", 36),
  ("c", "Charlie", 30),
  ("d", "David", 29),
  ("e", "Esther", 32),
  ("f", "Fanny", 36),
  ("g", "Gabby", 60)
], ["id", "name", "age"]) \
.withColumn("entity", lit("person"))

e = sqlContext.createDataFrame([
  ("a", "b", "friend"),
  ("b", "c", "follow"),
  ("c", "b", "follow"),
  ("f", "c", "follow"),
  ("e", "f", "follow"),
  ("e", "d", "friend"),
  ("d", "a", "friend"),
  ("a", "e", "friend")
], ["src", "dst", "relationship"])

g = GraphFrame(v, e)
In [7]:
g.vertices
Out[7]:
DataFrame[id: string, name: string, age: bigint, entity: string]
In [8]:
g.edges
Out[8]:
DataFrame[src: string, dst: string, relationship: string]
In [ ]: