Pachyderm

Overview
Concepts
- Pachyderm File System (PFS)
- Pachyderm Processing System (PPS)
Useful
- Inspect container
Tutorials
- Beginner Tutorial
Appendix A: Definitions
Appendix B: Files

Overview

Pachyderm is a data-lake which offers complete version control for data and leverages the container ecosystem to provide reproducible data processing.

Containers for data version control
Uses kubernetes as scheduler

Concepts

Pachyderm File System (PFS)

Input data will be found in /pfs/<input_repo_name>
Output data should always be written to /pfs/out

Pachyderm Processing System (PPS)

Specified using JSON
Creating a pipeline will run our code on every finished commit in the repo, and all future commits.

Useful

Inspect container

Of course this varies depending on how you're running the container, but in the case of using Kubernetes and Minikube, we

Tutorials

Beginner Tutorial

Reference.

Creating the repo

pachctl create-repo data

pachctl list-repo

Adding data

pachctl put-file data master sales -c -f https://raw.githubusercontent.com/pachyderm/pachyderm/v1.3.2/doc/examples/fruit_stand/set1.txt

-c specifies to start a commit, add data, then finish the commit
-f tells it where to read data from (can be a file, url, etc.)

pachctl list-repo

We can also view the specific commit we just made:

pachctl list-commit data

Even view the file we just added!

pachctl get-file data master sales

Creating a pipeline

See here for the JSON file specifying the pipeline.

pachctl create-pipeline -f https://raw.githubusercontent.com/pachyderm/pachyderm/v1.3.2/doc/examples/fruit_stand/pipeline.json

pachctl list-job

pachctl list-repo

Reading the output

pachctl get-file sum f963c94e10ec46f5afadfd290e7ae199/0 apple

Processing more data

Transformation is a map, and so data is processed incrementally
commit has a parental structure => tracks how they change over time

So let's add some more data.

pachctl put-file data master sales -c -f https://raw.githubusercontent.com/pachyderm/pachyderm/v1.3.2/doc/examples/fruit_stand/set2.txt

Which should have triggered a recomputation:

pachctl list-job

Aaaand thus our output data should also change, right?!

pachctl get-file sum f963c94e10ec46f5afadfd290e7ae199/1 apple

And it has! Great!

Exploring the file system

Can mount the file system of the container!!

mkdir -p ~/pfs && pachctl mount ~/pfs &  # background the process because it blocks

tree ~/pfs

That is just awesome!

To unmount we simply do pachctl unmount ~/pfs. If you add the flag -a it will remove all Pachyderm FUSE mounts.

Using mount you can read, write, etc. as you would with regular files, but to be able to make any changes you have to start a commit first. Then, when you're finished doing the do, you finish the commit, and all is well.

Appendix A: Definitions

repo: highest level primitive in PFS. Generally dedicated to a single source of data.
commit: immutable snapshot of data.

Appendix B: Files

{
  "pipeline": {
    "name": "filter"
  },
  "transform": {
    "cmd": [ "sh" ],
    "stdin": [
        "for fruit in apple orange banana; do",
        "   grep $fruit /pfs/data/sales | awk '{print $2}' >>/pfs/out/$fruit",
        "done"
    ]
  },
  "parallelism_spec": {
    "strategy": "CONSTANT",
    "constant": 1
  },
  "inputs": [
    {
      "repo": {
        "name": "data"
      },
      "method": "map"
    }
  ]
}
{
  "pipeline": {
    "name": "sum"
  },
  "transform": {
    "cmd": [ "sh" ],
    "stdin": [
        "for fruit in apple orange banana; do",
        "   { cat /pfs/prev/$fruit || echo 0; cat /pfs/filter/$fruit; } | awk '{s+=$1} END {print s}' > /pfs/out/$fruit",
        "done"
    ],
    "overwrite": true
  },
  "parallelism_spec": {
    "strategy": "CONSTANT",
    "constant": 1
  },
  "inputs": [
    {
      "repo": {
        "name": "filter"
      },
      "method": {
        "partition": "FILE",
        "incremental": "DIFF"
      }
    }
  ]
}