Pachyderm

Table of Contents

Overview

Pachyderm is a data-lake which offers complete version control for data and leverages the container ecosystem to provide reproducible data processing.

  • Containers for data version control
  • Uses kubernetes as scheduler

Concepts

Pachyderm File System (PFS)

  • Input data will be found in /pfs/<input_repo_name>
  • Output data should always be written to /pfs/out

Pachyderm Processing System (PPS)

  • Specified using JSON
  • Creating a pipeline will run our code on every finished commit in the repo, and all future commits.

Useful

Inspect container

Of course this varies depending on how you're running the container, but in the case of using Kubernetes and Minikube, we

Tutorials

Beginner Tutorial

Creating the repo

pachctl create-repo data
pachctl list-repo

Adding data

pachctl put-file data master sales -c -f https://raw.githubusercontent.com/pachyderm/pachyderm/v1.3.2/doc/examples/fruit_stand/set1.txt
  • -c specifies to start a commit, add data, then finish the commit
  • -f tells it where to read data from (can be a file, url, etc.)
pachctl list-repo

We can also view the specific commit we just made:

pachctl list-commit data

Even view the file we just added!

pachctl get-file data master sales

Creating a pipeline

See here for the JSON file specifying the pipeline.

pachctl create-pipeline -f https://raw.githubusercontent.com/pachyderm/pachyderm/v1.3.2/doc/examples/fruit_stand/pipeline.json
pachctl list-job
pachctl list-repo

Reading the output

pachctl get-file sum f963c94e10ec46f5afadfd290e7ae199/0 apple

Processing more data

  • Transformation is a map, and so data is processed incrementally
  • commit has a parental structure => tracks how they change over time

So let's add some more data.

pachctl put-file data master sales -c -f https://raw.githubusercontent.com/pachyderm/pachyderm/v1.3.2/doc/examples/fruit_stand/set2.txt

Which should have triggered a recomputation:

pachctl list-job

Aaaand thus our output data should also change, right?!

pachctl get-file sum f963c94e10ec46f5afadfd290e7ae199/1 apple

And it has! Great!

Exploring the file system

  • Can mount the file system of the container!!
mkdir -p ~/pfs && pachctl mount ~/pfs &  # background the process because it blocks
tree ~/pfs

That is just awesome!

To unmount we simply do pachctl unmount ~/pfs. If you add the flag -a it will remove all Pachyderm FUSE mounts.

Using mount you can read, write, etc. as you would with regular files, but to be able to make any changes you have to start a commit first. Then, when you're finished doing the do, you finish the commit, and all is well.

Appendix A: Definitions

repo
highest level primitive in PFS. Generally dedicated to a single source of data.
commit
immutable snapshot of data.

Appendix B: Files

{
  "pipeline": {
    "name": "filter"
  },
  "transform": {
    "cmd": [ "sh" ],
    "stdin": [
        "for fruit in apple orange banana; do",
        "   grep $fruit /pfs/data/sales | awk '{print $2}' >>/pfs/out/$fruit",
        "done"
    ]
  },
  "parallelism_spec": {
    "strategy": "CONSTANT",
    "constant": 1
  },
  "inputs": [
    {
      "repo": {
        "name": "data"
      },
      "method": "map"
    }
  ]
}
{
  "pipeline": {
    "name": "sum"
  },
  "transform": {
    "cmd": [ "sh" ],
    "stdin": [
        "for fruit in apple orange banana; do",
        "   { cat /pfs/prev/$fruit || echo 0; cat /pfs/filter/$fruit; } | awk '{s+=$1} END {print s}' > /pfs/out/$fruit",
        "done"
    ],
    "overwrite": true
  },
  "parallelism_spec": {
    "strategy": "CONSTANT",
    "constant": 1
  },
  "inputs": [
    {
      "repo": {
        "name": "filter"
      },
      "method": {
        "partition": "FILE",
        "incremental": "DIFF"
      }
    }
  ]
}