Pachyderm
Table of Contents
Overview
Pachyderm is a data-lake which offers complete version control for data and leverages the container ecosystem to provide reproducible data processing.
- Containers for data version control
- Uses
kubernetes
as scheduler
Concepts
Pachyderm File System (PFS)
- Input data will be found in
/pfs/<input_repo_name>
- Output data should always be written to
/pfs/out
Pachyderm Processing System (PPS)
- Specified using JSON
- Creating a
pipeline
will run our code on every finished commit in therepo
, and all future commits.
Useful
Inspect container
Of course this varies depending on how you're running the container, but in the case of using Kubernetes and Minikube, we
Tutorials
Beginner Tutorial
Creating the repo
pachctl create-repo data
pachctl list-repo
Adding data
pachctl put-file data master sales -c -f https://raw.githubusercontent.com/pachyderm/pachyderm/v1.3.2/doc/examples/fruit_stand/set1.txt
-c
specifies to start acommit
, add data, then finish thecommit
-f
tells it where to read data from (can be a file, url, etc.)
pachctl list-repo
We can also view the specific commit
we just made:
pachctl list-commit data
Even view the file we just added!
pachctl get-file data master sales
Creating a pipeline
See here for the JSON file specifying the pipeline
.
pachctl create-pipeline -f https://raw.githubusercontent.com/pachyderm/pachyderm/v1.3.2/doc/examples/fruit_stand/pipeline.json
pachctl list-job
pachctl list-repo
Reading the output
pachctl get-file sum f963c94e10ec46f5afadfd290e7ae199/0 apple
Processing more data
- Transformation is a
map
, and so data is processed incrementally commit
has a parental structure => tracks how they change over time
So let's add some more data.
pachctl put-file data master sales -c -f https://raw.githubusercontent.com/pachyderm/pachyderm/v1.3.2/doc/examples/fruit_stand/set2.txt
Which should have triggered a recomputation:
pachctl list-job
Aaaand thus our output data should also change, right?!
pachctl get-file sum f963c94e10ec46f5afadfd290e7ae199/1 apple
And it has! Great!
Exploring the file system
- Can
mount
the file system of the container!!
mkdir -p ~/pfs && pachctl mount ~/pfs & # background the process because it blocks
tree ~/pfs
That is just awesome!
To unmount
we simply do pachctl unmount ~/pfs
. If you add the flag -a
it will remove all Pachyderm FUSE mounts.
Using mount
you can read, write, etc. as you would with regular files,
but to be able to make any changes you have to start a commit
first.
Then, when you're finished doing the do, you finish the commit
, and all
is well.
Appendix A: Definitions
- repo
- highest level primitive in PFS. Generally dedicated to a single source of data.
- commit
- immutable snapshot of data.
Appendix B: Files
{ "pipeline": { "name": "filter" }, "transform": { "cmd": [ "sh" ], "stdin": [ "for fruit in apple orange banana; do", " grep $fruit /pfs/data/sales | awk '{print $2}' >>/pfs/out/$fruit", "done" ] }, "parallelism_spec": { "strategy": "CONSTANT", "constant": 1 }, "inputs": [ { "repo": { "name": "data" }, "method": "map" } ] } { "pipeline": { "name": "sum" }, "transform": { "cmd": [ "sh" ], "stdin": [ "for fruit in apple orange banana; do", " { cat /pfs/prev/$fruit || echo 0; cat /pfs/filter/$fruit; } | awk '{s+=$1} END {print s}' > /pfs/out/$fruit", "done" ], "overwrite": true }, "parallelism_spec": { "strategy": "CONSTANT", "constant": 1 }, "inputs": [ { "repo": { "name": "filter" }, "method": { "partition": "FILE", "incremental": "DIFF" } } ] }