diff --git a/docs/index.rst b/docs/index.rst index e141c791f..9b80ace4f 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -1,13 +1,15 @@ .. title:: Home .. toctree:: :maxdepth: 3 :includehidden: :name: Documentation - Install - Learn - API Docs - Reference + Install + Intro Tutorial + Learn + Demos + API Docs + Reference Deploying - Community + Community diff --git a/docs/sections/api/api.rst b/docs/sections/api/index.rst similarity index 100% rename from docs/sections/api/api.rst rename to docs/sections/api/index.rst diff --git a/docs/sections/community/community.rst b/docs/sections/community/index.rst similarity index 100% rename from docs/sections/community/community.rst rename to docs/sections/community/index.rst diff --git a/docs/sections/learn/airline_demo.rst b/docs/sections/demos/airline_demo.rst similarity index 100% rename from docs/sections/learn/airline_demo.rst rename to docs/sections/demos/airline_demo.rst diff --git a/docs/sections/demos/index.rst b/docs/sections/demos/index.rst new file mode 100644 index 000000000..5d4e2cfeb --- /dev/null +++ b/docs/sections/demos/index.rst @@ -0,0 +1,7 @@ +Demos +======================= + +.. toctree:: + :maxdepth: 1 + + airline_demo \ No newline at end of file diff --git a/docs/sections/install/install.rst b/docs/sections/install/index.rst similarity index 100% rename from docs/sections/install/install.rst rename to docs/sections/install/index.rst diff --git a/docs/sections/learn/tutorial/ML_pipeline.png b/docs/sections/intro_tutorial/ML_pipeline.png similarity index 100% rename from docs/sections/learn/tutorial/ML_pipeline.png rename to docs/sections/intro_tutorial/ML_pipeline.png diff --git a/docs/sections/learn/tutorial/RF_ipynb.png b/docs/sections/intro_tutorial/RF_ipynb.png similarity index 100% rename from docs/sections/learn/tutorial/RF_ipynb.png rename to docs/sections/intro_tutorial/RF_ipynb.png diff --git a/docs/sections/learn/tutorial/RF_output_notebook.png b/docs/sections/intro_tutorial/RF_output_notebook.png similarity index 100% rename from docs/sections/learn/tutorial/RF_output_notebook.png rename to docs/sections/intro_tutorial/RF_output_notebook.png diff --git a/docs/sections/learn/tutorial/clean_data_ipynb.png b/docs/sections/intro_tutorial/clean_data_ipynb.png similarity index 100% rename from docs/sections/learn/tutorial/clean_data_ipynb.png rename to docs/sections/intro_tutorial/clean_data_ipynb.png diff --git a/docs/sections/learn/tutorial/complex_pipeline_figure_one.png b/docs/sections/intro_tutorial/complex_pipeline_figure_one.png similarity index 100% rename from docs/sections/learn/tutorial/complex_pipeline_figure_one.png rename to docs/sections/intro_tutorial/complex_pipeline_figure_one.png diff --git a/docs/sections/learn/tutorial/composite_solids.png b/docs/sections/intro_tutorial/composite_solids.png similarity index 100% rename from docs/sections/learn/tutorial/composite_solids.png rename to docs/sections/intro_tutorial/composite_solids.png diff --git a/docs/sections/learn/tutorial/composite_solids.rst b/docs/sections/intro_tutorial/composite_solids.rst similarity index 91% rename from docs/sections/learn/tutorial/composite_solids.rst rename to docs/sections/intro_tutorial/composite_solids.rst index 59dc609b5..c8ba0609f 100644 --- a/docs/sections/learn/tutorial/composite_solids.rst +++ b/docs/sections/intro_tutorial/composite_solids.rst @@ -1,50 +1,50 @@ Composing solids ---------------- Abstracting business logic into reusable, configurable solids is one important step towards making data applications like other software applications. The other basic facility that we expect from software in other domains is composability -- the ability to combine building blocks into larger functional units. Composite solids can be used to organize and refactor large or complicated pipelines, abstracting away complexity, as well as to wrap reusable general-purpose solids together with domain-specific logic. As an example, let's compose two instances of a complex, general-purpose ``read_csv`` solid along with some domain-specific logic for the specific purpose of joining our cereal dataset with a lookup table providing human-readable names for the cereal manufacturers. -.. literalinclude:: ../../../../examples/dagster_examples/intro_tutorial/composite_solids.py +.. literalinclude:: ../../../examples/dagster_examples/intro_tutorial/composite_solids.py :linenos: :lines: 126-130 :lineno-start: 126 :caption: composite_solids.py Defining a composite solid is similar to defining a pipeline, except that we use the :py:func:`@composite_solid ` decorator instead of :py:func:`@pipeline `. Dagit has sophisticated facilities for visualizing composite solids: .. thumbnail:: composite_solids.png All of the complexity of the composite solid is hidden by default, but we can expand it at will by clicking into the solid (or on the "Expand" button in the right-hand pane): .. thumbnail:: composite_solids_expanded.png Note the line indicating that the output of ``join_cereal`` is returned as the output of the composite solid as a whole. Config for the individual solids making up the composite is nested, as follows: -.. literalinclude:: ../../../../examples/dagster_examples/intro_tutorial/composite_solids.yaml +.. literalinclude:: ../../../examples/dagster_examples/intro_tutorial/composite_solids.yaml :linenos: :language: YAML :caption: composite_solids.yaml :emphasize-lines: 1-3 When we execute this pipeline, Dagit includes information about the nesting of individual execution steps within the composite: .. thumbnail:: composite_solids_expanded.png diff --git a/docs/sections/learn/tutorial/composite_solids_execution.png b/docs/sections/intro_tutorial/composite_solids_execution.png similarity index 100% rename from docs/sections/learn/tutorial/composite_solids_execution.png rename to docs/sections/intro_tutorial/composite_solids_execution.png diff --git a/docs/sections/learn/tutorial/composite_solids_expanded.png b/docs/sections/intro_tutorial/composite_solids_expanded.png similarity index 100% rename from docs/sections/learn/tutorial/composite_solids_expanded.png rename to docs/sections/intro_tutorial/composite_solids_expanded.png diff --git a/docs/sections/learn/tutorial/conditional_outputs.png b/docs/sections/intro_tutorial/conditional_outputs.png similarity index 100% rename from docs/sections/learn/tutorial/conditional_outputs.png rename to docs/sections/intro_tutorial/conditional_outputs.png diff --git a/docs/sections/learn/tutorial/config.rst b/docs/sections/intro_tutorial/config.rst similarity index 91% rename from docs/sections/learn/tutorial/config.rst rename to docs/sections/intro_tutorial/config.rst index 95ae94328..073450b1d 100644 --- a/docs/sections/learn/tutorial/config.rst +++ b/docs/sections/intro_tutorial/config.rst @@ -1,91 +1,91 @@ Parametrizing solids with config -------------------------------- Solids often depend in predictable ways on features of the external world or the pipeline in which they're invoked. For example, consider an extended version of our csv-reading solid that implements more of the options available in the underlying Python API: -.. literalinclude:: ../../../../examples/dagster_examples/intro_tutorial/config_bad_1.py +.. literalinclude:: ../../../examples/dagster_examples/intro_tutorial/config_bad_1.py :lines: 6-26 :linenos: :lineno-start: 6 :emphasize-lines: 8-14 :caption: config_bad_1.py We obviously don't want to have to write a separate solid for each permutation of these parameters that we use in our pipelines -- especially because, in more realistic cases like configuring a Spark job or even parametrizing the ``read_csv`` function from a popular package like `Pandas `__, we might have dozens or hundreds of parameters like these. But hoisting all of these parameters into the signature of the solid function as inputs isn't the right answer either: -.. literalinclude:: ../../../../examples/dagster_examples/intro_tutorial/config_bad_2.py +.. literalinclude:: ../../../examples/dagster_examples/intro_tutorial/config_bad_2.py :lines: 6-36 :linenos: :lineno-start: 6 :emphasize-lines: 5-11 :caption: config_bad_2.py Defaults are often sufficient for configuation values like these, and sets of parameters are often reusable. And it's unlikely that values like this will be provided dynamically by the outputs of other solids in a pipeline. Inputs, on the other hand, will usually be provided by the outputs of other solids in a pipeline, even though we might sometimes want to stub them using the config facility. For all these reasons, it's bad practice to mix configuration values like these with true input values. The solution is to define a config schema for our solid: -.. literalinclude:: ../../../../examples/dagster_examples/intro_tutorial/config.py +.. literalinclude:: ../../../examples/dagster_examples/intro_tutorial/config.py :linenos: :lines: 1-102 :emphasize-lines: 15, 33-41, 87 :caption: config.py First, we pass the ``config`` argument to the :py:func:`@solid ` decorator. This tells Dagster to give our solid a config field structured as a dictionary, whose keys are the keys of this argument, and the types of whose values are defined by the values of this argument (instances of :py:func:`Field `). Then, we define one of these fields, ``escapechar``, to be a string, setting a default value, making it optional, and setting a human-readable description. Finally, inside the body of the solid function, we access the config value set by the user using the ``solid_config`` field on the familiar :py:class:`context ` object. When Dagster executes our pipeline, the framework will make validated config for each solid available on this object. Let's see how all of this looks in dagit. As usual, run: .. code-block:: console $ dagit -f config.py -n config_pipeline .. thumbnail:: config_figure_one.png As you may by now expect, Dagit provides a fully type-aware and schema-aware config editing environment with a typeahead. The human-readable descriptions we provided on our config fields appear in the config context minimap, as well as in typeahead tooltips and in the Explore pane when clicking into the individual solid definition. .. thumbnail:: config_figure_two.png You can see that we've added a new section to the solid config. In addition to the ``inputs`` section, which we'll still use to set the ``csv_path`` input, we now have a ``config`` section, where we can set values defined in the ``config`` argument to :py:func:`@solid `. -.. literalinclude:: ../../../../examples/dagster_examples/intro_tutorial/config_env_bad.yaml +.. literalinclude:: ../../../examples/dagster_examples/intro_tutorial/config_env_bad.yaml :linenos: :language: YAML :caption: config_env_bad.yaml Of course, this config won't give us the results we're expecting. The values in ``cereal.csv`` are comma-separated, not semicolon-separated, as they might be if this were a .csv from Europe, where commas are frequently used in place of the decimal point. We'll see later how we can use Dagster's facilities for automatic data quality checks to guard against semantic issues like this, which won't be caught by the type system. diff --git a/docs/sections/learn/tutorial/config_figure_one.png b/docs/sections/intro_tutorial/config_figure_one.png similarity index 100% rename from docs/sections/learn/tutorial/config_figure_one.png rename to docs/sections/intro_tutorial/config_figure_one.png diff --git a/docs/sections/learn/tutorial/config_figure_two.png b/docs/sections/intro_tutorial/config_figure_two.png similarity index 100% rename from docs/sections/learn/tutorial/config_figure_two.png rename to docs/sections/intro_tutorial/config_figure_two.png diff --git a/docs/sections/learn/tutorial/custom_types_figure_one.png b/docs/sections/intro_tutorial/custom_types_figure_one.png similarity index 100% rename from docs/sections/learn/tutorial/custom_types_figure_one.png rename to docs/sections/intro_tutorial/custom_types_figure_one.png diff --git a/docs/sections/learn/tutorial/custom_types_figure_two.png b/docs/sections/intro_tutorial/custom_types_figure_two.png similarity index 100% rename from docs/sections/learn/tutorial/custom_types_figure_two.png rename to docs/sections/intro_tutorial/custom_types_figure_two.png diff --git a/docs/sections/learn/tutorial/hello_cereal.rst b/docs/sections/intro_tutorial/hello_cereal.rst similarity index 94% rename from docs/sections/learn/tutorial/hello_cereal.rst rename to docs/sections/intro_tutorial/hello_cereal.rst index 06682586d..6636e9432 100644 --- a/docs/sections/learn/tutorial/hello_cereal.rst +++ b/docs/sections/intro_tutorial/hello_cereal.rst @@ -1,204 +1,204 @@ .. py:currentmodule:: dagster Hello, cereal! --------------- In this tutorial, we'll explore the feature set of Dagster with small examples that are intended to be illustrative of real data problems. We'll build these examples around a simple but scary .csv dataset, ``cereal.csv``, which contains nutritional facts about 80 breakfast cereals. You can find this dataset on `Github `_. Or, if you've cloned the dagster git repository, you'll find this dataset at ``dagster/examples/dagster_examples/intro_tutorial/cereal.csv``. To get the flavor of this dataset, let's look at the header and the first five rows: -.. literalinclude:: ../../../../examples/dagster_examples/intro_tutorial/cereal.csv +.. literalinclude:: ../../../examples/dagster_examples/intro_tutorial/cereal.csv :linenos: :lines: 1-6 :caption: cereals.csv :language: text Hello, solid! ^^^^^^^^^^^^^ Let's write our first Dagster solid and save it as ``hello_cereal.py``. (You can also find this file, and all of the tutorial code, on `Github `__ or, if you've cloned the git repo, at ``dagster/examples/dagster_examples/intro_tutorial/``.) A solid is a unit of computation in a data pipeline. Typically, you'll define solids by annotating ordinary Python functions with the :py:func:`@solid ` decorator. The logic in our first solid is very straightforward: it just reads in the csv from a hardcoded path and logs the number of rows it finds. -.. literalinclude:: ../../../../examples/dagster_examples/intro_tutorial/hello_cereal.py +.. literalinclude:: ../../../examples/dagster_examples/intro_tutorial/hello_cereal.py :linenos: :lines: 1-18 :caption: hello_cereal.py In this simplest case, our solid takes no inputs except for the :py:class:`context ` in which it executes (provided by the Dagster framework as the first argument to every solid), and also returns no outputs. Don't worry, we'll soon encounter solids that are much more dynamic. Hello, pipeline! ^^^^^^^^^^^^^^^^ To execute our solid, we'll embed it in an equally simple pipeline. A pipeline is a set of solids arranged into a DAG (or `directed acyclic graph `_) of computation. You'll typically define pipelines by annotating ordinary Python functions with the :py:func:`@pipeline ` decorator. -.. literalinclude:: ../../../../examples/dagster_examples/intro_tutorial/hello_cereal.py +.. literalinclude:: ../../../examples/dagster_examples/intro_tutorial/hello_cereal.py :linenos: :lineno-start: 21 :lines: 21-23 :caption: hello_cereal.py Here you'll see that we call ``hello_cereal()``. This call doesn't actually execute the solid -- within the body of functions decorated with :py:func:`@pipeline `, we use function calls to indicate the dependency structure of the solids making up the pipeline. Here, we indicate that the execution of ``hello_cereal`` doesn't depend on any other solids by calling it with no arguments. .. _executing-our-first-pipeline: Executing our first pipeline ---------------------------- Assuming you've saved this pipeline as ``hello_cereal.py``, we can execute it via any of three different mechanisms: 1. From the command line, using the ``dagster`` CLI. 2. From a rich graphical interface, using the ``dagit`` GUI tool. 3. From arbitrary Python scripts, using dagster's Python API. Using the dagster CLI to execute a pipeline ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ From the directory in which you've saved the pipeline file, just run: .. code-block:: console $ dagster pipeline execute -f hello_cereal.py -n hello_cereal_pipeline You'll see the full stream of events emitted by dagster appear in the console, including our call to the logging machinery, which will look like: .. code-block:: console :emphasize-lines: 1 2019-10-10 11:46:50 - dagster - INFO - system - a91a4cc4-d218-4c2b-800c-aac50fced1a5 - Found 77 cereals solid = "hello_cereal" solid_definition = "hello_cereal" step_key = "hello_cereal.compute" Success! Using dagit to execute a pipeline ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ To visualize your pipeline (which only has one node) in dagit, from the directory in which you've saved the pipeline file, just run: .. code-block:: console $ dagit -f hello_cereal.py -n hello_cereal_pipeline You'll see output like .. code-block:: console Loading repository... Serving on http://127.0.0.1:3000 You should be able to navigate to http://127.0.0.1:3000/p/hello_cereal_pipeline/explore in your web browser and view your pipeline. It isn't very interesting yet, because it only has one node. .. thumbnail:: hello_cereal_figure_one.png Clicking on the "Execute" tab (http://127.0.0.1:3000/p/hello_world_pipeline/execute) and you'll see the two-paned view below. .. thumbnail:: hello_cereal_figure_two.png The left hand pane is empty here, but in more complicated pipelines, this is where you'll be able to edit pipeline configuration on the fly. The right hand pane shows the concrete execution plan corresponding to the logical structure of the pipeline -- which also only has one node, ``hello_cereal.compute``. Click the "Start Execution" button to execute this plan directly from dagit. A new window should open, and you'll see a much more structured view of the stream of Dagster events start to appear in the left-hand pane. (If you have pop-up blocking enabled, you may need to tell your browser to allow pop-ups from 127.0.0.1 -- or, just navigate to the "Runs" tab to see this, and every run of your pipeline.) .. thumbnail:: hello_cereal_figure_three.png In this view, you can filter and search through the logs corresponding to your pipeline run. Using the Python API to execute a pipeline ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ If you'd rather execute your pipelines as a script, you can do that without using the dagster CLI at all. Just add a few lines to ``hello_cereal.py``: -.. literalinclude:: ../../../../examples/dagster_examples/intro_tutorial/hello_cereal.py +.. literalinclude:: ../../../examples/dagster_examples/intro_tutorial/hello_cereal.py :linenos: :lineno-start: 26 :lines: 26-28 :caption: hello_cereal.py Now you can just run: .. code-block:: console $ python hello_cereal.py The :py:func:`execute_pipeline` function called here is the core Python API for executing Dagster pipelines from code. Testing solids and pipelines ---------------------------- Our first solid and pipeline wouldn't be complete without some tests to ensure they're working as expected. We'll use :py:func:`execute_pipeline` to test our pipeline, as well as :py:func:`execute_solid` to test our solid in isolation. These functions synchronously execute a pipeline or solid and return results objects (the :py:class:`SolidExecutionResult` and :py:class:`PipelineExecutionResult`) whose methods let us investigate, in detail, the success or failure of execution, the outputs produced by solids, and (as we'll see later) other events associated with execution. -.. literalinclude:: ../../../../examples/dagster_examples/intro_tutorial/hello_cereal.py +.. literalinclude:: ../../../examples/dagster_examples/intro_tutorial/hello_cereal.py :linenos: :caption: hello_cereal.py :lineno-start: 31 :lines: 31-40 Now you can use pytest, or your test runner of choice, to run unit tests as you develop your data applications. .. code-block:: console $ pytest hello_cereal.py Note: pytest tests are typically in files prefixed with `test_`. However in order to simplify the tutorial we have them in the same file. Obviously, in production we'll often execute pipelines in a parallel, streaming way that doesn't admit this kind of API, which is intended to enable local tests like this. Dagster is written to make testing easy in a domain where it has historically been very difficult. Throughout the rest of this tutorial, we'll explore the writing of unit tests for each piece of the framework as we learn about it. diff --git a/docs/sections/learn/tutorial/hello_cereal_figure_one.png b/docs/sections/intro_tutorial/hello_cereal_figure_one.png similarity index 100% rename from docs/sections/learn/tutorial/hello_cereal_figure_one.png rename to docs/sections/intro_tutorial/hello_cereal_figure_one.png diff --git a/docs/sections/learn/tutorial/hello_cereal_figure_three.png b/docs/sections/intro_tutorial/hello_cereal_figure_three.png similarity index 100% rename from docs/sections/learn/tutorial/hello_cereal_figure_three.png rename to docs/sections/intro_tutorial/hello_cereal_figure_three.png diff --git a/docs/sections/learn/tutorial/hello_cereal_figure_two.png b/docs/sections/intro_tutorial/hello_cereal_figure_two.png similarity index 100% rename from docs/sections/learn/tutorial/hello_cereal_figure_two.png rename to docs/sections/intro_tutorial/hello_cereal_figure_two.png diff --git a/docs/sections/learn/tutorial/hello_dag.rst b/docs/sections/intro_tutorial/hello_dag.rst similarity index 92% rename from docs/sections/learn/tutorial/hello_dag.rst rename to docs/sections/intro_tutorial/hello_dag.rst index 35a49751c..1453b522d 100644 --- a/docs/sections/learn/tutorial/hello_dag.rst +++ b/docs/sections/intro_tutorial/hello_dag.rst @@ -1,88 +1,88 @@ .. py:currentmodule:: dagster Connecting solids together -------------------------- Our pipelines wouldn't be very interesting if they were limited to solids acting in isolation from each other. Pipelines are useful because they let us connect solids into arbitrary DAGs (`directed acyclic graphs `_) of computation. Let's get serial ^^^^^^^^^^^^^^^^ We'll add a second solid to the pipeline we worked with in the first section of the tutorial. This new solid will consume the output of the first solid, which read the cereal dataset in from disk, and in turn will sort the list of cereals by their calorie content per serving. -.. literalinclude:: ../../../../examples/dagster_examples/intro_tutorial/serial_pipeline.py +.. literalinclude:: ../../../examples/dagster_examples/intro_tutorial/serial_pipeline.py :linenos: :lines: 1-37 :emphasize-lines: 15, 19, 37 :caption: serial_pipeline.py You'll see that we've modified our existing ``load_cereals`` solid to return an output, in this case the list of dicts into which :py:class:``csv.DictReader `` reads the cereals dataset. We've defined our new solid, ``sort_by_calories``, to take a user-defined input, ``cereals``, in addition to the system-provided :py:class:`context ` object. We can use inputs and outputs to connect solids to each other. Here we tell Dagster that although ``load_cereals`` doesn't depend on the output of any other solid, ``sort_by_calories`` does -- it depends on the output of ``load_cereals``. Let's visualize the DAG we've just defined in dagit. .. code-block:: console $ dagit -f serial_pipeline.py -n serial_pipeline Navigate to http://127.0.0.1:3000/p/serial_pipeline/explore or choose "serial_pipeline" from the dropdown: .. thumbnail:: serial_pipeline_figure_one.png A more complex DAG ^^^^^^^^^^^^^^^^^^ Solids don't need to be wired together serially. The output of one solid can be consumed by any number of other solids, and the outputs of several different solids can be consumed by a single solid. -.. literalinclude:: ../../../../examples/dagster_examples/intro_tutorial/complex_pipeline.py +.. literalinclude:: ../../../examples/dagster_examples/intro_tutorial/complex_pipeline.py :linenos: :lines: 6-64 :lineno-start: 6 :emphasize-lines: 55-59 :caption: complex_pipeline.py First we introduce the intermediate variable ``cereals`` into our pipeline definition to represent the output of the ``load_cereals`` solid. Then we make both ``sort_by_calories`` and ``sort_by_protein`` consume this output. Their outputs are in turn both consumed by ``display_results``. Let's visualize this pipeline in Dagit (``dagit -f complex_pipeline.py -n complex_pipeline``): .. thumbnail:: complex_pipeline_figure_one.png When you execute this example from Dagit, you'll see that ``load_cereals`` executes first, followed by ``sort_by_calories`` and ``sort_by_protein`` -- in any order -- and that ``display_results`` executes last, only after ``sort_by_calories`` and ``sort_by_protein`` have both executed. In more sophisticated execution environments, ``sort_by_calories`` and ``sort_by_protein`` could execute not just in any order, but at the same time, since they don't depend on each other's outputs -- but both would still have to execute after ``load_cereals`` (because they depend on its output) and before ``display_results`` (because ``display_results`` depends on both of their outputs). We'll write a simple test for this pipeline showing how we can assert that all four of its solids executed successfully. -.. literalinclude:: ../../../../examples/dagster_examples/intro_tutorial/complex_pipeline.py +.. literalinclude:: ../../../examples/dagster_examples/intro_tutorial/complex_pipeline.py :linenos: :lines: 72-77 :lineno-start: 72 :caption: complex_pipeline.py diff --git a/docs/sections/learn/tutorial/index.rst b/docs/sections/intro_tutorial/index.rst similarity index 51% rename from docs/sections/learn/tutorial/index.rst rename to docs/sections/intro_tutorial/index.rst index fdfc20aa8..f73ad9727 100644 --- a/docs/sections/learn/tutorial/index.rst +++ b/docs/sections/intro_tutorial/index.rst @@ -1,23 +1,29 @@ -Tutorial +Intro Tutorial ======================= +If you're new to Dagster, we recommend going through this tutorial to become familiar with building +Dagster applications with a gentle introduction Dagster's feature set and tooling. + +In this tutorial, we'll walk through creating a Dagster application that processes csv dataset containing +nutritional facts about cereals. + .. toctree:: :maxdepth: 1 hello_cereal hello_dag inputs config types multiple_outputs composite_solids materializations intermediates resources repos scheduler intro_airflow You can find all of the tutorial code checked into the dagster repository at ``dagster/examples/dagster_examples/intro_tutorial``. diff --git a/docs/sections/learn/tutorial/inputs.rst b/docs/sections/intro_tutorial/inputs.rst similarity index 91% rename from docs/sections/learn/tutorial/inputs.rst rename to docs/sections/intro_tutorial/inputs.rst index 50eca7ebd..a68d03dc1 100644 --- a/docs/sections/learn/tutorial/inputs.rst +++ b/docs/sections/intro_tutorial/inputs.rst @@ -1,239 +1,239 @@ .. py:currentmodule:: dagster Parametrizing solids with inputs -------------------------------- So far, we've only seen solids whose behavior is the same every time they're run: -.. literalinclude:: ../../../../examples/dagster_examples/intro_tutorial/serial_pipeline.py +.. literalinclude:: ../../../examples/dagster_examples/intro_tutorial/serial_pipeline.py :lines: 6-15 :linenos: :lineno-start: 6 :caption: serial_pipeline.py In general, though, rather than relying on hardcoded values like ``dataset_path``, we'd like to be able to parametrize our solid logic. Appropriately parameterized solids are more testable, and also more reusable. Consider the following more generic solid: -.. literalinclude:: ../../../../examples/dagster_examples/intro_tutorial/inputs.py +.. literalinclude:: ../../../examples/dagster_examples/intro_tutorial/inputs.py :lines: 6-12 :linenos: :lineno-start: 6 :caption: inputs.py Here, rather than hardcoding the value of ``dataset_path``, we use an input, ``csv_path``. It's easy to see why this is better. We can reuse the same solid in all the different places we might need to read in a .csv from a filepath. We can test the solid by pointing it at some known test csv file. And we can use the output of another upstream solid to determine which file to load. Let's rebuild a pipeline we've seen before, but this time using our newly parameterized solid. -.. literalinclude:: ../../../../examples/dagster_examples/intro_tutorial/inputs.py +.. literalinclude:: ../../../examples/dagster_examples/intro_tutorial/inputs.py :lines: 1-36 :linenos: :emphasize-lines: 36 :caption: inputs.py As you can see above, what's missing from this setup is a way to specify the ``csv_path`` input to our new ``read_csv`` solid in the absence of any upstream solids whose outputs we can rely on. Dagster provides the ability to stub inputs to solids that aren't satisfied by the pipeline topology as part of its flexible configuration facility. We can specify config for a pipeline execution regardless of which modality we use to execute the pipeline -- the Python API, the Dagit GUI, or the command line. Specifying config in the Python API ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ We previously encountered the :py:func:`execute_pipeline` function. Pipeline configuration is specified by the second argument to this function, which must be a dict (the "environment dict"). This dict contains all of the user-provided configuration with which to execute a pipeline. As such, it can have :ref:`a lot of sections `, but we'll only use one of them here: per-solid configuration, which is specified under the key ``solids``: -.. literalinclude:: ../../../../examples/dagster_examples/intro_tutorial/inputs.py +.. literalinclude:: ../../../examples/dagster_examples/intro_tutorial/inputs.py :linenos: :lineno-start: 40 :lines: 40-44 :dedent: 4 :caption: inputs.py The ``solids`` dict is keyed by solid name, and each solid is configured by a dict that may itself have several sections. In this case we are only interested in the ``inputs`` section, so that we can specify the value of the input ``csv_path``. Now you can pass this environment dict to :py:func:`execute_pipeline`: -.. literalinclude:: ../../../../examples/dagster_examples/intro_tutorial/inputs.py +.. literalinclude:: ../../../examples/dagster_examples/intro_tutorial/inputs.py :linenos: :lines: 45-47 :dedent: 4 :lineno-start: 45 :caption: inputs.py Specifying config using YAML fragments and the dagster CLI ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ When executing pipelines with the dagster CLI, we'll need to provide the environment dict in a config file. We use YAML for the file-based representation of an environment dict, but the values are the same as before: -.. literalinclude:: ../../../../examples/dagster_examples/intro_tutorial/inputs_env.yaml +.. literalinclude:: ../../../examples/dagster_examples/intro_tutorial/inputs_env.yaml :language: YAML :linenos: :caption: inputs_env.yaml We can pass config files in this format to the dagster CLI tool with the ``-e`` flag. .. code-block:: console $ dagster pipeline execute -f inputs.py -n inputs_pipeline -e inputs_env.yaml In practice, you might have different sections of your environment dict in different yaml files -- if, for instance, some sections change more often (e.g. in test and prod) while other are more static. In this case, you can set multiple instances of the ``-e`` flag on CLI invocations, and the CLI tools will assemble the YAML fragments into a single environment dict. Using the Dagit config editor ----------------------------- Dagit provides a powerful, schema-aware, typeahead-enabled config editor to enable rapid experimentation with and debugging of parameterized pipeline executions. As always, run: .. code-block:: console $ dagit -f inputs.py -n inputs_pipeline Notice the error in the right hand pane of the **Execute** tab. .. thumbnail:: inputs_figure_one.png Because Dagit is schema-aware, it knows that this pipeline now requires configuration in order to run without errors. In this case, since the pipeline is relatively trivial, it wouldn't be especially costly to run the pipeline and watch it fail. But when pipelines are complex and slow, it's invaluable to get this kind of feedback up front rather than have an unexpected failure deep inside a pipeline. Recall that the execution plan, which we would ordinarily see in the right-hand pane of the **Execute** tab, is the concrete pipeline that Dagster will actually execute. Without a valid config, Dagster can't construct a parametrization of the logical pipeline -- so no execution plan is available for us to preview. Press `CTRL-Space` in order to bring up the typeahead assistant. .. thumbnail:: inputs_figure_two.png Here you can see all of the sections available in the environment dict. Don't worry, we'll get to them all later. Let's enter the config we need in order to execute our pipeline. .. thumbnail:: inputs_figure_three.png Note that as you type and edit the config, the config minimap hovering on the right side of the editor pane changes to provide context -- so you always know where in the nested config schema you are while making changes. Writing tests that supply inputs to solids ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ You'll frequently want to provide test inputs to solids. You can use :py:func:`execute_pipeline` and the environment dict to do this, but you can also pass input values directly using the :py:func:`execute_solid` API. This can be especially useful when it is cumbersome or impossible to parametrize an input through the environment dict. For example, we may want to test ``sort_by_calories`` on a controlled data set where we know the most and least caloric cereals in advance, but without having to flow its input from an upstream solid implementing a data ingest process. -.. literalinclude:: ../../../../examples/dagster_examples/intro_tutorial/test_inputs.py +.. literalinclude:: ../../../examples/dagster_examples/intro_tutorial/test_inputs.py :lines: 9-22 :lineno-start: 9 :linenos: :caption: test_inputs.py When we execute this test (e.g., using pytest), we'll be reminded again of one of the reasons why it's always a good idea to write unit tests, even for the most seemingly trivial components. .. code-block:: shell :emphasize-lines: 17 ================================ FAILURES ================================= __________________________ test_sort_by_calories __________________________ def test_sort_by_calories(): res = execute_solid( sort_by_calories, input_values={ 'cereals': [ {'name': 'just_lard', 'calories': '1100'}, {'name': 'dry_crust', 'calories': '20'} ] } ) assert res.success output_value = res.output_value() > assert output_value['most_caloric'] == 'just_lard' E AssertionError: assert {'calories': '20', 'name': 'dry_crust'} == 'just_lard' test_inputs.py:18: AssertionError It looks as though we've forgotten to coerce our calorie counts to integers before sorting by them. (Alternatively, we could modify our ``load_cereals`` logic to extend the basic functionality provided by :py:class:`python:csv.DictReader` and add a facility to specify column-wise datatype conversion.) Type-checking inputs -------------------- Note that this section requires Python 3. If you zoom in on the **Explore** tab in Dagit and click on one of our pipeline solids, you'll see that its inputs and outputs are annotated with types. .. thumbnail:: inputs_figure_four.png By default, every untyped value in Dagster is assigned the catch-all type :py:class:`Any`. This means that any errors in the config won't be surfaced until the pipeline is executed. For example, when we execute our pipeline with this config, it'll fail at runtime: -.. literalinclude:: ../../../../examples/dagster_examples/intro_tutorial/inputs_env_bad.yaml +.. literalinclude:: ../../../examples/dagster_examples/intro_tutorial/inputs_env_bad.yaml :language: YAML :linenos: :caption: inputs_env_bad.yaml When we enter this mistyped config in Dagit and execute our pipeline, you'll see that an error appears in the structured log viewer pane of the **Execute** tab: .. thumbnail:: inputs_figure_five.png Click on "View Full Message" or on the red dot on the execution step that failed and a detailed stacktrace will pop up. .. thumbnail:: inputs_figure_six.png It would be better if we could catch this error earlier, when we specify the config. So let's make the inputs typed. A user can apply types to inputs and outputs using Python 3's type annotation syntax. In this case, we just want to type the input as the built-in ``str``. -.. literalinclude:: ../../../../examples/dagster_examples/intro_tutorial/inputs_typed.py +.. literalinclude:: ../../../examples/dagster_examples/intro_tutorial/inputs_typed.py :lines: 6-12 :emphasize-lines: 2 :linenos: :lineno-start: 6 :caption: inputs_typed.py By using typed input instead we can catch this error prior to execution, and reduce the surface area we need to test and guard against in user code. .. thumbnail:: inputs_figure_seven.png diff --git a/docs/sections/learn/tutorial/inputs_figure_five.png b/docs/sections/intro_tutorial/inputs_figure_five.png similarity index 100% rename from docs/sections/learn/tutorial/inputs_figure_five.png rename to docs/sections/intro_tutorial/inputs_figure_five.png diff --git a/docs/sections/learn/tutorial/inputs_figure_four.png b/docs/sections/intro_tutorial/inputs_figure_four.png similarity index 100% rename from docs/sections/learn/tutorial/inputs_figure_four.png rename to docs/sections/intro_tutorial/inputs_figure_four.png diff --git a/docs/sections/learn/tutorial/inputs_figure_one.png b/docs/sections/intro_tutorial/inputs_figure_one.png similarity index 100% rename from docs/sections/learn/tutorial/inputs_figure_one.png rename to docs/sections/intro_tutorial/inputs_figure_one.png diff --git a/docs/sections/learn/tutorial/inputs_figure_seven.png b/docs/sections/intro_tutorial/inputs_figure_seven.png similarity index 100% rename from docs/sections/learn/tutorial/inputs_figure_seven.png rename to docs/sections/intro_tutorial/inputs_figure_seven.png diff --git a/docs/sections/learn/tutorial/inputs_figure_six.png b/docs/sections/intro_tutorial/inputs_figure_six.png similarity index 100% rename from docs/sections/learn/tutorial/inputs_figure_six.png rename to docs/sections/intro_tutorial/inputs_figure_six.png diff --git a/docs/sections/learn/tutorial/inputs_figure_three.png b/docs/sections/intro_tutorial/inputs_figure_three.png similarity index 100% rename from docs/sections/learn/tutorial/inputs_figure_three.png rename to docs/sections/intro_tutorial/inputs_figure_three.png diff --git a/docs/sections/learn/tutorial/inputs_figure_two.png b/docs/sections/intro_tutorial/inputs_figure_two.png similarity index 100% rename from docs/sections/learn/tutorial/inputs_figure_two.png rename to docs/sections/intro_tutorial/inputs_figure_two.png diff --git a/docs/sections/learn/tutorial/intermediates.png b/docs/sections/intro_tutorial/intermediates.png similarity index 100% rename from docs/sections/learn/tutorial/intermediates.png rename to docs/sections/intro_tutorial/intermediates.png diff --git a/docs/sections/learn/tutorial/intermediates.rst b/docs/sections/intro_tutorial/intermediates.rst similarity index 90% rename from docs/sections/learn/tutorial/intermediates.rst rename to docs/sections/intro_tutorial/intermediates.rst index f8b118f64..864c1380f 100644 --- a/docs/sections/learn/tutorial/intermediates.rst +++ b/docs/sections/intro_tutorial/intermediates.rst @@ -1,97 +1,97 @@ .. _tutorial-intermediates: Intermediates ------------- We've already seen how solids can describe their persistent artifacts to the system using `materializations `_. Dagster also has a facility for automatically materializing the intermediate values that actually pass between solids. This can be very useful for debugging, when you want to inspect the value output by a solid and ensure that it is as you expect; for audit, when you want to understand how a particular downstream output was created; and for re-executing downstream solids with cached results from expensive upstream computations. To turn intermediate storage on, just set another key in the pipeline config: -.. literalinclude:: ../../../../examples/dagster_examples/intro_tutorial/intermediates.yaml +.. literalinclude:: ../../../examples/dagster_examples/intro_tutorial/intermediates.yaml :linenos: :emphasize-lines: 6-7 :caption: intermediates.yaml When you execute the pipeline using this config, you'll see new structured entries in the Dagit log viewer indicating that intermediates have been stored on the filesystem. .. thumbnail:: intermediates.png Intermediate storage for types that cannot be pickled ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ By default, Dagster will try to pickle intermediate values to store them on the filesystem. Some custom data types cannot be pickled (for instance, a Spark RDD), so you will need to tell Dagster how to serialize them. Our toy ``LessSimpleDataFrame`` is, of course, pickleable, but supposing it was not, let's set a custom :py:class:`SerializationStrategy ` on it to tell Dagster how to store intermediates of this type. -.. literalinclude:: ../../../../examples/dagster_examples/intro_tutorial/serialization_strategy.py +.. literalinclude:: ../../../examples/dagster_examples/intro_tutorial/serialization_strategy.py :lines: 12-36 :linenos: :lineno-start: 12 :caption: serialization_strategy.py Now, when we set the ``storage`` key in pipeline config and run this pipeline, we'll see that our intermediate is automatically persisted as a human-readable .csv: .. thumbnail:: serialization_strategy.png Reexecution ^^^^^^^^^^^ Once intermediates are being stored, Dagit makes it possible to individually execute solids whose outputs are satisfied by previously materialized intermediates. Click the small run button to the right of the ``sort_by_calories.compute`` execution step to reexecute only this step, using the automatically materialized intermediate output of the previous solid. .. thumbnail:: reexecution.png Reexecuting individual solids can be very helpful while you're writing solids, or while you're actively debugging them. You can also manually specify intermediates from previous runs as inputs to solids. Recall the syntax we used to set input values using the config system: -.. literalinclude:: ../../../../examples/dagster_examples/intro_tutorial/inputs_env.yaml +.. literalinclude:: ../../../examples/dagster_examples/intro_tutorial/inputs_env.yaml :language: YAML :linenos: :caption: inputs_env.yaml Instead of setting the key ``value`` (i.e., providing a ), we can also set ``pickle``, as follows: -.. literalinclude:: ../../../../examples/dagster_examples/intro_tutorial/reexecution_env.yaml +.. literalinclude:: ../../../examples/dagster_examples/intro_tutorial/reexecution_env.yaml :language: YAML :linenos: :caption: reexecution_env.yaml (Of course, you'll need to use the path to an intermediate that is actually present on your filesystem.) If you directly substitute this config into Dagit, you'll see an error, because the system still expects the input to ``sort_by_calories`` to be satisfied by the output from ``read_csv``. .. thumbnail:: reexecution_errors.png To make this config valid, we'll need to tell Dagit to execute only a subset of the pipeline -- just the ``sort_by_calories`` solid. Click on the subset-selector button in the bottom left of the config editor screen (which, when no subset has been specified, will read "All Solids"): .. thumbnail:: subset_selection.png Hit "Apply", and this config will now pass validation, and the individual solid can be reexecuted: .. thumbnail:: subset_config.png This facility is especially valuable during test, since it allows you to validate newly written solids against values generated during previous runs of a known good pipeline. diff --git a/docs/sections/learn/tutorial/intro_airflow.rst b/docs/sections/intro_tutorial/intro_airflow.rst similarity index 96% rename from docs/sections/learn/tutorial/intro_airflow.rst rename to docs/sections/intro_tutorial/intro_airflow.rst index d92a16c5c..81ffe4ab0 100644 --- a/docs/sections/learn/tutorial/intro_airflow.rst +++ b/docs/sections/intro_tutorial/intro_airflow.rst @@ -1,109 +1,109 @@ Deploying to Airflow -------------------- Although Dagster includes stand-alone functionality for :ref:`executing `, :ref:`scheduling `, and :ref:`deploying pipelines on AWS `, we also support an incremental adoption path on top of existing `Apache Airflow `_ installs. We've seen how Dagster compiles a logical pipeline definition, appropriately parameterized by config, into a concrete execution plan for Dagster's execution engines. Dagster pipelines can also be compiled into the DAG format used by Airflow -- or, in principle, to any other intermediate representation used by a third-party scheduling or orchestration engine. Let's see how to get our simple cereal pipeline to run on Airflow every morning before breakfast, at 6:45 AM. Requirements ^^^^^^^^^^^^ You'll need an existing Airflow installation, and to install the ``dagster-airflow`` library into the Python environments in which your Airflow webserver and worker run: .. code-block:: shell $ pip install dagster-airflow You'll also need to make sure that the Dagster pipeline you want to run using Airflow is available in the Python environments in which your Airflow webserver and worker run. Scaffolding your pipeline for Airflow ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ We'll use our familiar, simple demo pipeline: -.. literalinclude:: ../../../../examples/dagster_examples/intro_tutorial/airflow.py +.. literalinclude:: ../../../examples/dagster_examples/intro_tutorial/airflow.py :linenos: :caption: airflow.py To compile this existing pipeline to Airflow, we'll use the ``dagster-airflow`` CLI tool. By default, this tool will write the Airflow-compatible DAG scaffold out to ``$AIRFLOW_HOME/dags``. .. code-block:: shell $ dagster-airflow scaffold \ --module-name dagster_examples.intro_tutorial.airflow \ --pipeline-name hello_cereal_pipeline Wrote DAG scaffold to file: $AIRFLOW_HOME/dags/hello_cereal_pipeline.py Let's take a look at the generated file: -.. literalinclude:: ../../../../examples/dagster_examples/intro_tutorial/hello_cereal_pipeline.py +.. literalinclude:: ../../../examples/dagster_examples/intro_tutorial/hello_cereal_pipeline.py :linenos: :caption: hello_cereal_pipeline.py This is a fairly straightforward file with four parts. First, we import the basic prerequisites to define our DAG (and also make sure that the string "DAG" appears in the file, so that the Airflow webserver will detect it). Second, we define the config that Dagster will compile our pipeline against. Unlike Dagster pipelines, Airflow DAGs can't be parameterized dynamically at execution time, so this config is static after it's loaded by the Airflow webserver. Third, we set the ``DEFAULT_ARGS`` that will be passed down as the ``default_args`` argument to the ``airflow.DAG`` `constructor `_. Finally, we define the DAG and its constituent tasks using :py:func:`make_airflow_dag `. If you run this code interactively, you'll see that ``dag`` and ``tasks`` are ordinary Airflow objects, just as you'd expect to see when defining an Airflow pipeline manually: .. code-block:: python >>> dag >>> tasks [] Running a pipeline on Airflow ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Ensure that the Airflow webserver, scheduler (and any workers appropriate to the executor you have configured) are running. The ``dagster-airflow`` CLI tool will automatically put the generated DAG definition in ``$AIRLFLOW_HOME/dags``, but if you have a different setup, you should make sure that this file is wherever the Airflow webserver looks for DAG definitions. When you fire up the Airflow UI, you should see the new DAG: .. thumbnail:: intro_airflow_one.png Kick off a run manually, and you'll be able to use the ordinary views of the pipeline's progress: .. thumbnail:: intro_airflow_dag_view.png And logs will be available in the Airflow log viewer: .. thumbnail:: intro_airflow_logs_view.png More sophisticated setups ^^^^^^^^^^^^^^^^^^^^^^^^^ This approach requires that Dagster, all of the Python requirements of your Dagster pipelines, and your Dagaster pipelines themselves be importable in the environment in which your Airflow tasks operate, because under the hood it uses Airflow's :py:class:`airflow:PythonOperator` to define tasks. ``dagster-airflow`` also includes facilities for compiling Dagster pipelines to Airflow DAGs whose tasks use the :py:class:`DockerOperator ` or :py:class:`KubernetesPodOperator `. For details, see the :ref:`deployment guide for Airflow `. diff --git a/docs/sections/learn/tutorial/intro_airflow_dag_view.png b/docs/sections/intro_tutorial/intro_airflow_dag_view.png similarity index 100% rename from docs/sections/learn/tutorial/intro_airflow_dag_view.png rename to docs/sections/intro_tutorial/intro_airflow_dag_view.png diff --git a/docs/sections/learn/tutorial/intro_airflow_logs_view.png b/docs/sections/intro_tutorial/intro_airflow_logs_view.png similarity index 100% rename from docs/sections/learn/tutorial/intro_airflow_logs_view.png rename to docs/sections/intro_tutorial/intro_airflow_logs_view.png diff --git a/docs/sections/learn/tutorial/intro_airflow_one.png b/docs/sections/intro_tutorial/intro_airflow_one.png similarity index 100% rename from docs/sections/learn/tutorial/intro_airflow_one.png rename to docs/sections/intro_tutorial/intro_airflow_one.png diff --git a/docs/sections/learn/tutorial/materializations.png b/docs/sections/intro_tutorial/materializations.png similarity index 100% rename from docs/sections/learn/tutorial/materializations.png rename to docs/sections/intro_tutorial/materializations.png diff --git a/docs/sections/learn/tutorial/materializations.rst b/docs/sections/intro_tutorial/materializations.rst similarity index 87% rename from docs/sections/learn/tutorial/materializations.rst rename to docs/sections/intro_tutorial/materializations.rst index 0eac49c6a..5b7eacc5f 100644 --- a/docs/sections/learn/tutorial/materializations.rst +++ b/docs/sections/intro_tutorial/materializations.rst @@ -1,86 +1,86 @@ Materializations ---------------- Steps in a data pipeline often produce persistent artifacts, for instance, graphs or tables describing the result of some computation. Typically these artifacts are saved to disk (or to cloud storage) with a `name `__ that has something to do with their origin. But it can be hard to organize and cross-reference artifacts produced by many different runs of a pipeline, or to identify all of the files that might have been created by some pipeline's logic. Dagster solids can describe their persistent artifacts to the system by yielding :py:class:`Materialization ` events. Like :py:class:`TypeCheck ` and :py:class:`ExpectationResult `, materializations are side-channels for metadata -- they don't get passed to downstream solids and they aren't used to define the data dependencies that structure a pipeline's DAG. Suppose that we rewrite our ``sort_calories`` solid so that it saves the newly sorted data frame to disk. -.. literalinclude:: ../../../../examples/dagster_examples/intro_tutorial/materializations.py +.. literalinclude:: ../../../examples/dagster_examples/intro_tutorial/materializations.py :lines: 23-43 :linenos: :lineno-start: 23 :caption: materializations.py We've taken the basic precaution of ensuring that the saved csv file has a different filename for each run of the pipeline. But there's no way for Dagit to know about this persistent artifact. So we'll add the following lines: -.. literalinclude:: ../../../../examples/dagster_examples/intro_tutorial/materializations.py +.. literalinclude:: ../../../examples/dagster_examples/intro_tutorial/materializations.py :lines: 23-53 :linenos: :lineno-start: 23 :emphasize-lines: 22-31 :caption: materializations.py Note that we've had to add the last line, yielding an :py:class:`Output `. Until now, all of our solids have relied on Dagster's implicit conversion of the return value of a solid's compute function into its output. When we explicitly yield other types of events from solid logic, we need to also explicitly yield the output so that the framework can recognize them. Now, if we run this pipeline in Dagit: .. thumbnail:: materializations.png Configurably materializing custom data types ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Data types can also be configured so that outputs materialize themselves, obviating the need to explicitly yield a :py:class:`Materialization ` from solid logic. Dagster calls this facility the :py:func:`@output_materialization_config `. Suppose we would like to be able to configure outputs of our toy custom type, the ``SimpleDataFrame``, to be automatically materialized to disk as both as a pickle and as a .csv. (This is a reasonable idea, since .csv files are human-readable and manipulable by a wide variety of third party tools, while pickle is a binary format.) -.. literalinclude:: ../../../../examples/dagster_examples/intro_tutorial/output_materialization.py +.. literalinclude:: ../../../examples/dagster_examples/intro_tutorial/output_materialization.py :lines: 29-64 :linenos: :lineno-start: 29 :caption: output_materialization.py We set the output materialization config on the type: -.. literalinclude:: ../../../../examples/dagster_examples/intro_tutorial/output_materialization.py +.. literalinclude:: ../../../examples/dagster_examples/intro_tutorial/output_materialization.py :lines: 67-74 :linenos: :lineno-start: 67 :emphasize-lines: 5 :caption: output_materialization.py Now we can tell Dagster to materialize intermediate outputs of this type by providing config: -.. literalinclude:: ../../../../examples/dagster_examples/intro_tutorial/output_materialization.yaml +.. literalinclude:: ../../../examples/dagster_examples/intro_tutorial/output_materialization.yaml :linenos: :lines: 6-10 :lineno-start: 6 :caption: output_materialization.yaml When we run this pipeline, we'll see that materializations are yielded (and visible in the structured logs in Dagit), and that files are created on disk (with the semicolon separator we specified). .. thumbnail:: output_materializations.png diff --git a/docs/sections/learn/tutorial/modes.png b/docs/sections/intro_tutorial/modes.png similarity index 100% rename from docs/sections/learn/tutorial/modes.png rename to docs/sections/intro_tutorial/modes.png diff --git a/docs/sections/learn/tutorial/multiple_outputs.png b/docs/sections/intro_tutorial/multiple_outputs.png similarity index 100% rename from docs/sections/learn/tutorial/multiple_outputs.png rename to docs/sections/intro_tutorial/multiple_outputs.png diff --git a/docs/sections/learn/tutorial/multiple_outputs.rst b/docs/sections/intro_tutorial/multiple_outputs.rst similarity index 89% rename from docs/sections/learn/tutorial/multiple_outputs.rst rename to docs/sections/intro_tutorial/multiple_outputs.rst index 4454f819f..16d34b8a3 100644 --- a/docs/sections/learn/tutorial/multiple_outputs.rst +++ b/docs/sections/intro_tutorial/multiple_outputs.rst @@ -1,88 +1,88 @@ Multiple and conditional outputs -------------------------------- Solids can have arbitrarily many outputs, and downstream solids can depends on any number of these. What's more, outputs don't necessarily have to be yielded by solids, which lets us write pipelines where some solids conditionally execute based on the presence of an upstream output. Suppose we're interested in splitting hot and cold cereals into separate datasets and processing them separately, based on config. -.. literalinclude:: ../../../../examples/dagster_examples/intro_tutorial/multiple_outputs.py +.. literalinclude:: ../../../examples/dagster_examples/intro_tutorial/multiple_outputs.py :linenos: :lineno-start: 31 :caption: multiple_outputs.py :lines: 31-55 :emphasize-lines: 6-13, 20, 25 Solids that yield multiple outputs must declare, and name, their outputs (passing ``output_defs`` to the :py:func:`@solid ` decorator). Output names must be unique and each :py:func:`Output ` yielded by a solid's compute function must have a name that corresponds to one of these declared outputs. We'll define two downstream solids and hook them up to the multiple outputs from ``split_cereals``. -.. literalinclude:: ../../../../examples/dagster_examples/intro_tutorial/multiple_outputs.py +.. literalinclude:: ../../../examples/dagster_examples/intro_tutorial/multiple_outputs.py :linenos: :lineno-start: 58 :caption: multiple_outputs.py :lines: 58-82 :emphasize-lines: 23-25 As usual, we can visualize this in Dagit: .. thumbnail:: multiple_outputs.png Notice that the logical DAG corresponding to the pipeline definition includes both dependencies -- we won't know about the conditionality in the pipeline until runtime, when one of the outputs is not yielded by ``split_cereal``. .. thumbnail:: multiple_outputs_zoom.png Zooming in, Dagit shows us the details of the multiple outputs from ``split_cereals`` and their downstream dependencies. When we execute this pipeline with the following config, we'll see that the cold cereals output is omitted and that the execution step corresponding to the downstream solid is marked skipped in the right hand pane. .. thumbnail:: conditional_outputs.png Reusable solids --------------- Solids are intended to abstract chunks of business logic, but abstractions aren't very meaningful unless they can be reused. Our conditional outputs pipeline included a lot of repeated code -- ``sort_hot_cereals_by_calories`` and ``sort_cold_cereals_by_calories``, for instance. In general, it's preferable to build pipelines out of a relatively restricted set of well-tested library solids, using config liberally to parametrize them. (You'll certainly have your own version of ``read_csv``, for instance, and Dagster includes libraries like ``dagster_aws`` and ``dagster_spark`` to wrap and abstract interfaces with common third party tools.) Let's replace ``sort_hot_cereals_by_calories`` and ``sort_cold_cereals_by_calories`` by two aliases of the same library solid: -.. literalinclude:: ../../../../examples/dagster_examples/intro_tutorial/reusable_solids.py +.. literalinclude:: ../../../examples/dagster_examples/intro_tutorial/reusable_solids.py :linenos: :lineno-start: 59 :lines: 59-76 :emphasize-lines: 15-16 :caption: reusable_solids.py You'll see that Dagit distinguishes between the two invocations of the single library solid and the solid's definition. The invocation is named and bound via a dependency graph to other invocations of other solids. The definition is the generic, resuable piece of logic that is invoked many times within this pipeline. .. thumbnail:: reusable_solids.png Configuring solids also uses the aliases, as in the following YAML: -.. literalinclude:: ../../../../examples/dagster_examples/intro_tutorial/reusable_solids.yaml +.. literalinclude:: ../../../examples/dagster_examples/intro_tutorial/reusable_solids.yaml :linenos: :emphasize-lines: 6, 8 :caption: reusable_solids.yaml diff --git a/docs/sections/learn/tutorial/multiple_outputs_zoom.png b/docs/sections/intro_tutorial/multiple_outputs_zoom.png similarity index 100% rename from docs/sections/learn/tutorial/multiple_outputs_zoom.png rename to docs/sections/intro_tutorial/multiple_outputs_zoom.png diff --git a/docs/sections/learn/tutorial/multiple_results_figure_one.png b/docs/sections/intro_tutorial/multiple_results_figure_one.png similarity index 100% rename from docs/sections/learn/tutorial/multiple_results_figure_one.png rename to docs/sections/intro_tutorial/multiple_results_figure_one.png diff --git a/docs/sections/learn/tutorial/output_materializations.png b/docs/sections/intro_tutorial/output_materializations.png similarity index 100% rename from docs/sections/learn/tutorial/output_materializations.png rename to docs/sections/intro_tutorial/output_materializations.png diff --git a/docs/sections/learn/tutorial/output_notebook.png b/docs/sections/intro_tutorial/output_notebook.png similarity index 100% rename from docs/sections/learn/tutorial/output_notebook.png rename to docs/sections/intro_tutorial/output_notebook.png diff --git a/docs/sections/learn/tutorial/post_boilerplate_notebook.png b/docs/sections/intro_tutorial/post_boilerplate_notebook.png similarity index 100% rename from docs/sections/learn/tutorial/post_boilerplate_notebook.png rename to docs/sections/intro_tutorial/post_boilerplate_notebook.png diff --git a/docs/sections/learn/tutorial/pre_boilerplate_notebook.png b/docs/sections/intro_tutorial/pre_boilerplate_notebook.png similarity index 100% rename from docs/sections/learn/tutorial/pre_boilerplate_notebook.png rename to docs/sections/intro_tutorial/pre_boilerplate_notebook.png diff --git a/docs/sections/learn/tutorial/presets.png b/docs/sections/intro_tutorial/presets.png similarity index 100% rename from docs/sections/learn/tutorial/presets.png rename to docs/sections/intro_tutorial/presets.png diff --git a/docs/sections/learn/tutorial/reexecution.png b/docs/sections/intro_tutorial/reexecution.png similarity index 100% rename from docs/sections/learn/tutorial/reexecution.png rename to docs/sections/intro_tutorial/reexecution.png diff --git a/docs/sections/learn/tutorial/reexecution_errors.png b/docs/sections/intro_tutorial/reexecution_errors.png similarity index 100% rename from docs/sections/learn/tutorial/reexecution_errors.png rename to docs/sections/intro_tutorial/reexecution_errors.png diff --git a/docs/sections/learn/tutorial/repos.png b/docs/sections/intro_tutorial/repos.png similarity index 100% rename from docs/sections/learn/tutorial/repos.png rename to docs/sections/intro_tutorial/repos.png diff --git a/docs/sections/learn/tutorial/repos.rst b/docs/sections/intro_tutorial/repos.rst similarity index 92% rename from docs/sections/learn/tutorial/repos.rst rename to docs/sections/intro_tutorial/repos.rst index b573f5a77..bd06347ab 100644 --- a/docs/sections/learn/tutorial/repos.rst +++ b/docs/sections/intro_tutorial/repos.rst @@ -1,54 +1,54 @@ Organizing pipelines in repositories ------------------------------------ In all of the examples we've seen so far, we've specified a file (``-f``) or a module (``-m``) and named a pipeline definition function (``-n``) in order to tell the CLI tools how to load a pipeline, e.g.: .. code-block:: console $ dagit -f hello_cereal.py -n hello_cereal_pipeline $ dagster pipeline execute -f hello_cereal.py \ -n hello_cereal_pipeline But most of the time, especially when working on long-running projects with other people, we will want to be able to target many pipelines at once with our tools. Dagster provides the concept of a repository, a collection of pipelines that the Dagster tools can target as a whole. Repositories are declared using the :py:func:`RepositoryDefinition ` API as follows: -.. literalinclude:: ../../../../examples/dagster_examples/intro_tutorial/repos.py +.. literalinclude:: ../../../examples/dagster_examples/intro_tutorial/repos.py :linenos: :lines: 10-24 :lineno-start: 10 :caption: repos.py Note that the name of the pipeline in the ``RepositoryDefinition`` must match the name we declared for it in its ``pipeline`` (the default is the function name). Don't worry, if these names don't match, you'll see a helpful error message. If you save this file as ``repos.py``, you can then run the command line tools on it. Try running: .. code-block:: console $ dagit -f repos.py -n define_repo Now you can see the list of all pipelines in the repo via the dropdown at the top: .. thumbnail:: repos.png Typing the name of the file and function defining the repository gets tiresome and repetitive, so let's create a declarative config file with this information to make using the command line tools easier. Save this file as ``repository.yaml``. This is the default name for a repository config file, although you can tell the CLI tools to use any file you like with the ``-y`` flag. -.. literalinclude:: ../../../../examples/dagster_examples/intro_tutorial/repository.yaml +.. literalinclude:: ../../../examples/dagster_examples/intro_tutorial/repository.yaml :language: YAML :caption: repository.yaml Now you should be able to list the pipelines in this repo without all the typing: .. code-block:: console $ dagit diff --git a/docs/sections/learn/tutorial/resources.rst b/docs/sections/intro_tutorial/resources.rst similarity index 90% rename from docs/sections/learn/tutorial/resources.rst rename to docs/sections/intro_tutorial/resources.rst index 4d7368d5e..25886ab03 100644 --- a/docs/sections/learn/tutorial/resources.rst +++ b/docs/sections/intro_tutorial/resources.rst @@ -1,196 +1,196 @@ Parametrizing pipelines with resources -------------------------------------- Often, we'll want to be able to configure pipeline-wide facilities, like uniform access to the file system, databases, or cloud services. Dagster models interactions with features of the external environment like these as resources (and library modules such as ``dagster_aws``, ``dagster_gcp``, and even ``dagster_slack`` provide out-of-the-box implementations for many common external services). Typically, your data processing pipelines will want to store their results in a data warehouse somewhere separate from the raw data sources. We'll adjust our toy pipeline so that it does a little more work on our cereal dataset, stores the finished product in a swappable data warehouse, and lets the team know when we're finished. You might have noticed that our cereal dataset isn't normalized -- that is, the serving sizes for some cereals are as small as a quarter of a cup, and for others are as large as a cup and a half. This grossly understates the nutritional difference between our different cereals. Let's transform our dataset and then store it in a normalized table in the warehouse: -.. literalinclude:: ../../../../examples/dagster_examples/intro_tutorial/resources.py +.. literalinclude:: ../../../examples/dagster_examples/intro_tutorial/resources.py :caption: resources.py :linenos: :lineno-start: 57 :lines: 57-80 :emphasize-lines: 24 Resources are another facility that Dagster makes available on the ``context`` object passed to solid logic. Note that we've completely encapsulated access to the database behind the call to ``context.resources.warehouse.update_normalized_cereals``. This means that we can easily swap resource implementations -- for instance, to test against a local SQLite database instead of a production Snowflake database; to abstract software changes, such as swapping raw SQL for SQLAlchemy; or to accomodate changes in business logic, like moving from an overwriting scheme to append-only, date-partitioned tables. To implement a resource and specify its config schema, we use the :py:class:`@resource ` decorator. The decorated function should return whatever object you wish to make available under the specific resource's slot in ``context.resources``. Resource constructor functions have access to their own ``context`` argument, which gives access to resource-specific config. (Unlike the contexts we've seen so far, which are instances of :py:class:`SystemComputeExecutionContext `, this context is an instance of :py:class:`InitResourceContext `.) -.. literalinclude:: ../../../../examples/dagster_examples/intro_tutorial/resources.py +.. literalinclude:: ../../../examples/dagster_examples/intro_tutorial/resources.py :caption: resources.py :linenos: :lineno-start: 16 :lines: 16-45 :emphasize-lines: 28-30 The last thing we need to do is to attach the resource to our pipeline, so that it's properly initialized when the pipeline run begins and made available to our solid logic as ``context.resources.warehouse``. -.. literalinclude:: ../../../../examples/dagster_examples/intro_tutorial/resources.py +.. literalinclude:: ../../../examples/dagster_examples/intro_tutorial/resources.py :caption: resources.py :linenos: :lineno-start: 83 :lines: 83-91 :emphasize-lines: 2-6 All resources are associated with a :py:class:`ModeDefinition `. So far, all of our pipelines have had only a single, system default mode, so we haven't had to tell Dagster what mode to run them in. Even in this case, where we provide a single anonymous mode to the :py:func:`@pipeline ` decorator, we won't have to specify which mode to use (it will take the place of the ``default`` mode). We can put it all together with the following config: -.. literalinclude:: ../../../../examples/dagster_examples/intro_tutorial/resources.yaml +.. literalinclude:: ../../../examples/dagster_examples/intro_tutorial/resources.yaml :caption: resources.yaml :language: YAML :linenos: :emphasize-lines: 1-4 (Here, we pass the special string ``":memory:"`` in config as the connection string for our database -- this is how SQLite designates an in-memory database.) Expressing resource dependencies ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ We've provided a ``warehouse`` resource to our pipeline, but we're still manually managing our pipeline's dependency on this resource. Dagster also provides a way for solids to advertise their resource requirements, to make it easier to keep track of which resources need to be provided for a pipeline. -.. literalinclude:: ../../../../examples/dagster_examples/intro_tutorial/required_resources.py +.. literalinclude:: ../../../examples/dagster_examples/intro_tutorial/required_resources.py :caption: required_resources.py :linenos: :lineno-start: 55 :lines: 55-78 :emphasize-lines: 1 Now, the Dagster machinery knows that this solid requires a resource called ``warehouse`` to be present on its mode definitions, and will complain if that resource is not present. Pipeline modes -------------- By attaching different sets of resources with the same APIs to different modes, we can support running pipelines -- with unchanged business logic -- in different environments. So you might have a "unittest" mode that runs against an in-memory SQLite database, a "dev" mode that runs against Postgres, and a "prod" mode that runs against Snowflake. Separating the resource definition from the business logic makes pipelines testable. As long as the APIs of the resources agree, and the fundamental operations they expose are tested in each environment, we can test business logic independent of environments that may be very costly or difficult to test against. -.. literalinclude:: ../../../../examples/dagster_examples/intro_tutorial/modes.py +.. literalinclude:: ../../../examples/dagster_examples/intro_tutorial/modes.py :lines: 74-83 :linenos: :lineno-start: 74 :caption: modes.py Even if you're not familiar with SQLAlchemy, it's enough to note that this is a very different implementation of the ``warehouse`` resource. To make this implementation available to Dagster, we attach it to a :py:class:`ModeDefinition `. -.. literalinclude:: ../../../../examples/dagster_examples/intro_tutorial/modes.py +.. literalinclude:: ../../../examples/dagster_examples/intro_tutorial/modes.py :lines: 126-141 :linenos: :lineno-start: 126 :caption: modes.py Each of the ways we can invoke a Dagster pipeline lets us select which mode we'd like to run it in. From the command line, we can set ``-d`` or ``--mode`` and select the name of the mode: .. code-block:: shell $ dagster pipeline execute -f modes.py -n modes_pipeline -d dev From the Python API, we can use the :py:class:`RunConfig `: -.. literalinclude:: ../../../../examples/dagster_examples/intro_tutorial/modes.py +.. literalinclude:: ../../../examples/dagster_examples/intro_tutorial/modes.py :lines: 151-155 :linenos: :lineno-start: 151 :emphasize-lines: 4 :dedent: 4 :caption: modes.py And in Dagit, we can use the "Mode" selector to pick the mode in which we'd like to execute. .. thumbnail:: modes.png The config editor is Dagit is mode-aware, so when you switch modes and introduce a resource that requires additional config, the editor will prompt you. Pipeline config presets ----------------------- Useful as the Dagit config editor and the ability to stitch together YAML fragments is, once pipelines have been productionized and config is unlikely to change, it's often useful to distribute pipelines with embedded config. For example, you might point solids at different S3 buckets in different environments, or want to pull database credentials from different environment variables. Dagster calls this a config preset: -.. literalinclude:: ../../../../examples/dagster_examples/intro_tutorial/presets.py +.. literalinclude:: ../../../examples/dagster_examples/intro_tutorial/presets.py :lines: 127-166 :linenos: :lineno-start: 127 :caption: presets.py :emphasize-lines: 14-37 We illustrate two ways of defining a preset. The first is to pass an ``environment_dict`` literal to the constructor. Because this dict is defined in Python, you can do arbitrary computation to construct it -- for instance, picking up environment variables, making a call to a secrets store like Hashicorp Vault, etc. The second is to use the ``from_files`` static constructor, and pass a list of file globs from which to read YAML fragments. Order matters in this case, and keys from later files will overwrite keys from earlier files. To select a preset for execution, we can use the CLI, the Python API, or Dagit. From the CLI, use ``-p`` or ``--preset``: .. code-block:: shell $ dagster pipeline execute -f presets.py -n presets_pipeline -p unittest From Python, you can use :py:func:`execute_pipeline_with_preset `: -.. literalinclude:: ../../../../examples/dagster_examples/intro_tutorial/presets.py +.. literalinclude:: ../../../examples/dagster_examples/intro_tutorial/presets.py :lines: 171 :dedent: 4 And in Dagit, we can use the "Presets" selector. .. thumbnail:: presets.png diff --git a/docs/sections/learn/tutorial/reusable_solids.png b/docs/sections/intro_tutorial/reusable_solids.png similarity index 100% rename from docs/sections/learn/tutorial/reusable_solids.png rename to docs/sections/intro_tutorial/reusable_solids.png diff --git a/docs/sections/learn/tutorial/scheduler.rst b/docs/sections/intro_tutorial/scheduler.rst similarity index 93% rename from docs/sections/learn/tutorial/scheduler.rst rename to docs/sections/intro_tutorial/scheduler.rst index ceffbf271..f8ec43285 100644 --- a/docs/sections/learn/tutorial/scheduler.rst +++ b/docs/sections/intro_tutorial/scheduler.rst @@ -1,144 +1,144 @@ .. _scheduler: Scheduling pipeline runs ------------------------ Dagster includes a simple built-in scheduler that works with Dagit for control and monitoring. Suppose that we need to run our simple cereal pipeline every morning before breakfast, at 6:45 AM. Requirements ^^^^^^^^^^^^ You'll need to install the ``dagster-cron`` library. .. code-block:: shell $ pip install dagster-cron You must also ensure that ``cron`` is installed on the machine you're running the scheduler on. Pipeline ^^^^^^^^ -.. literalinclude:: ../../../../examples/dagster_examples/intro_tutorial/scheduler.py +.. literalinclude:: ../../../examples/dagster_examples/intro_tutorial/scheduler.py :linenos: :lines: 1-34 :caption: scheduler.py As before, we've defined some solids, a pipeline, and a repository. Defining the scheduler ^^^^^^^^^^^^^^^^^^^^^^ We first need to define the Scheduler on our :py:class:`~dagster.core.instance.DagsterInstance`. For now, the only implemented scheduler is ``dagster_cron.SystemCronScheduler``, but this is pluggable (and you can write your own). To use the scheduler, add the following lines to your ``$DAGSTER_HOME/dagster.yaml``: .. code-block:: yaml :caption: dagster.yaml scheduler: module: dagster_cron.cron_scheduler class: SystemCronScheduler config: artifacts_dir: /path/to/dagster_home/schedule Defining schedules ^^^^^^^^^^^^^^^^^^ Now we'll write a :py:class:`ScheduleDefinition ` to define the schedule we want. We pass the ``cron_schedule`` parameter to this class to define when the pipeline should run using the standard cron syntax; the other parameters determine other familiar aspects of how the pipeline will run, such as its config. We wrap the schedule definition in a function decorated with :py:class:`@schedules `: -.. literalinclude:: ../../../../examples/dagster_examples/intro_tutorial/scheduler.py +.. literalinclude:: ../../../examples/dagster_examples/intro_tutorial/scheduler.py :linenos: :lines: 36-45 :lineno-start: 38 :caption: scheduler.py To complete the picture, we'll need to extend the ``repository.yaml`` structure we've met before with a new key, ``scheduler``. -.. literalinclude:: ../../../../examples/dagster_examples/intro_tutorial/scheduler.yaml +.. literalinclude:: ../../../examples/dagster_examples/intro_tutorial/scheduler.yaml :linenos: :caption: scheduler.yaml Starting schedules ^^^^^^^^^^^^^^^^^^ Whenever we make changes to schedule definitions using the ``SystemCronScheduler``, we need to run ``dagster schedule up``. This utility will create, update, or remove schedules in the underlying system cron file as appropriate to assure it is consistent with the schedule definitions in code. To preview the changes, first run: .. code-block:: console $ dagster schedule up --preview -y scheduler.yaml Planned Changes: + good_morning (add) After confirming schedule changes are as expected, run: .. code-block:: console $ dagster schedule up -y scheduler.yaml Changes: + good_morning (add) Verify that the ``good_morning`` scheduled job has been added to ``cron``: .. code-block:: console $ crontab -l If the ``good_morning`` job is not listed, you may have to start it with: .. code-block:: console $ dagster schedule start good_morning Now, we can load dagit to view the schedule and monitor runs: .. code-block:: console $ dagit -y scheduler.yaml Cron filters ^^^^^^^^^^^^^^^^^^ If you need to define a more specific schedule than cron allows, you can pass a function in the ``should_execute`` argument to :py:class:`ScheduleDefinition `. For example, we can define a filter that only returns `True` on weekdays: .. code-block:: python import datetime def weekday_filter(): weekno = datetime.datetime.today().weekday() # Returns true if current day is a weekday return weekno < 5 If we combine this `should_execute` filter with a `cron_schedule` that runs at 6:45am every day, then we’ll have a schedule that runs at 6:45am only on weekdays. .. code-block:: python good_weekday_morning = ScheduleDefinition( name="good_weekday_morning", cron_schedule="45 6 * * *", pipeline_name="hello_cereal_pipeline", environment_dict={"storage": {"filesystem": {}}}, should_execute=weekday_filter, ) diff --git a/docs/sections/learn/tutorial/serial_pipeline_figure_one.png b/docs/sections/intro_tutorial/serial_pipeline_figure_one.png similarity index 100% rename from docs/sections/learn/tutorial/serial_pipeline_figure_one.png rename to docs/sections/intro_tutorial/serial_pipeline_figure_one.png diff --git a/docs/sections/learn/tutorial/serialization_strategy.png b/docs/sections/intro_tutorial/serialization_strategy.png similarity index 100% rename from docs/sections/learn/tutorial/serialization_strategy.png rename to docs/sections/intro_tutorial/serialization_strategy.png diff --git a/docs/sections/learn/tutorial/subset_config.png b/docs/sections/intro_tutorial/subset_config.png similarity index 100% rename from docs/sections/learn/tutorial/subset_config.png rename to docs/sections/intro_tutorial/subset_config.png diff --git a/docs/sections/learn/tutorial/subset_selection.png b/docs/sections/intro_tutorial/subset_selection.png similarity index 100% rename from docs/sections/learn/tutorial/subset_selection.png rename to docs/sections/intro_tutorial/subset_selection.png diff --git a/docs/sections/learn/tutorial/test_add_pipeline.png b/docs/sections/intro_tutorial/test_add_pipeline.png similarity index 100% rename from docs/sections/learn/tutorial/test_add_pipeline.png rename to docs/sections/intro_tutorial/test_add_pipeline.png diff --git a/docs/sections/learn/tutorial/types.rst b/docs/sections/intro_tutorial/types.rst similarity index 90% rename from docs/sections/learn/tutorial/types.rst rename to docs/sections/intro_tutorial/types.rst index a197438cd..2143f9fa5 100644 --- a/docs/sections/learn/tutorial/types.rst +++ b/docs/sections/intro_tutorial/types.rst @@ -1,279 +1,279 @@ User-defined types ------------------ Note that this section requires Python 3. We've seen how we can type the inputs and outputs of solids using Python 3's typing system, and how to use Dagster's built-in config types, such as :py:class:`dagster.String`, to define config schemas for our solids. But what about when you want to define your own types? Let's look back at our simple ``read_csv`` solid. -.. literalinclude:: ../../../../examples/dagster_examples/intro_tutorial/inputs_typed.py +.. literalinclude:: ../../../examples/dagster_examples/intro_tutorial/inputs_typed.py :lines: 6-12 :caption: inputs_typed.py :linenos: :lineno-start: 6 The ``lines`` object returned by Python's built-in ``csv.DictReader`` is a list of ``collections.OrderedDict``, each of which represents one row of the dataset: .. code-block:: python [ OrderedDict([ ('name', '100% Bran'), ('mfr', 'N'), ('type', 'C'), ('calories', '70'), ('protein', '4'), ('fat', '1'), ('sodium', '130'), ('carbo', '5'), ('sugars', '6'), ('potass', '280'), ('vitamins', '25'), ('shelf', '3'), ('weight', '1'), ('cups', '0.33'), ('rating', '68.402973') ]), OrderedDict([ ('name', '100% Natural Bran'), ('mfr', 'Q'), ('type', 'C'), ('calories', '120'), ('protein', '3'), ('fat', '5'), ('sodium', '15'), ('fiber', '2'), ('carbo', '8'), ('sugars', '8'), ('potass', '135'), ('vitamins', '0'), ('shelf', '3'), ('weight', '1'), ('cups', '1'), ('rating', '33.983679') ]), ... ] This is a simple representation of a "data frame", or a table of data. We'd like to be able to use Dagster's type system to type the output of ``read_csv``, so that we can do type checking when we construct the pipeline, ensuring that any solid consuming the output of ``read_csv`` expects to receive a data frame. To do this, we'll use the :py:func:`@usable_as_dagster_type ` decorator: -.. literalinclude:: ../../../../examples/dagster_examples/intro_tutorial/custom_types.py +.. literalinclude:: ../../../examples/dagster_examples/intro_tutorial/custom_types.py :lines: 3-14 :linenos: :lineno-start: 3 :caption: custom_types.py Now we can annotate the rest of our pipeline with our new type: -.. literalinclude:: ../../../../examples/dagster_examples/intro_tutorial/custom_types.py +.. literalinclude:: ../../../examples/dagster_examples/intro_tutorial/custom_types.py :lines: 17-43 :emphasize-lines: 2, 7, 11 :linenos: :lineno-start: 17 :caption: custom_types.py The type metadata now appears in dagit and the system will ensure the input and output to this solid are indeed instances of SimpleDataFrame. As usual, run: .. code-block:: console $ dagit -f custom_types.py -n custom_type_pipeline .. thumbnail:: custom_types_figure_one.png You can see that the output of ``read_csv`` (which by default has the name ``result``) is marked to be of type ``SimpleDataFrame``. Custom type checks ^^^^^^^^^^^^^^^^^^ The Dagster framework will fail type checks when a value isn't an instance of the type we're expecting, e.g., if ``read_csv`` were to return a ``str`` rather than a ``SimpleDataFrame``. Sometimes we know more about the types of our values, and we'd like to do deeper type checks. For example, in the case of the ``SimpleDataFrame``, we expect to see a list of OrderedDicts, and for each of these OrderedDicts to have the same fields, in the same order. The :py:func:`@usable_as_dagster_type ` decorator lets us specify custom type checks like this. -.. literalinclude:: ../../../../examples/dagster_examples/intro_tutorial/custom_types_2.py +.. literalinclude:: ../../../examples/dagster_examples/intro_tutorial/custom_types_2.py :lines: 6-29 :linenos: :lineno-start: 6 :emphasize-lines: 1, 21 :caption: custom_types_2.py Now, if our solid logic fails to return the right type, we'll see a type check failure. Let's replace our ``read_csv`` solid with the following bad logic: -.. literalinclude:: ../../../../examples/dagster_examples/intro_tutorial/custom_types_2.py +.. literalinclude:: ../../../examples/dagster_examples/intro_tutorial/custom_types_2.py :lines: 32-38 :linenos: :lineno-start: 32 :caption: custom_types_2.py When we run the pipeline with this solid, we'll see an error like: .. code-block:: console 2019-10-12 13:19:19 - dagster - ERROR - custom_type_pipeline - 266c6a93-75e2-46dc-8bd7-d684ce91d0d1 - STEP_FAILURE - Execution of step "bad_read_csv.compute" failed. cls_name = "Failure" solid = "bad_read_csv" solid_definition = "bad_read_csv" step_key = "bad_read_csv.compute" user_failure_data = {"description": "LessSimpleDataFrame should be a list of OrderedDicts, got for row 1", "label": "intentional-failure", "metadata_entries": []} .. input_hydration_config: Providing input values for custom types in config ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ We saw earlier how, when a solid doesn't receive all of its inputs from other solids further upstream in the pipeline, we can specify its input values in config: -.. literalinclude:: ../../../../examples/dagster_examples/intro_tutorial/inputs_env.yaml +.. literalinclude:: ../../../examples/dagster_examples/intro_tutorial/inputs_env.yaml :language: YAML :linenos: :caption: inputs_env.yaml The Dagster framework knows how to intepret values provided via config as scalar inputs. In this case, ``read_csv`` just takes the string representation of the filepath from which it'll read a csv. But for more complex, custom types, we need to tell Dagster how to interpret config values. Consider our LessSimpleDataFrame. It might be convenient if Dagster knew automatically how to read a data frame in from a csv file, without us needing to separate that logic into the ``read_csv`` solid -- especially if we knew the provenance and format of that csv file (e.g., if we were using standard csvs as an internal interchange format) and didn't need the full configuration surface of a general purpose ``read_csv`` solid. What we want to be able to do is write: -.. literalinclude:: ../../../../examples/dagster_examples/intro_tutorial/custom_type_input.yaml +.. literalinclude:: ../../../examples/dagster_examples/intro_tutorial/custom_type_input.yaml :language: YAML :linenos: :caption: custom_type_input.yaml In order for the Dagster machinery to be able to decode the config value ``{'csv': 'cereal.csv'}`` into an input of the correct ``LessSimpleDataFrame`` value, we need to write what we call an input hydration config. -.. literalinclude:: ../../../../examples/dagster_examples/intro_tutorial/custom_types_3.py +.. literalinclude:: ../../../examples/dagster_examples/intro_tutorial/custom_types_3.py :lines: 31-37 :linenos: :lineno-start: 31 :caption: custom_types_3.py :emphasize-lines: 1 A function decorated with :py:func:`@input_hydration_config ` should take the context object, as usual, and a parameter representing the parsed config field. The schema for this field is defined by the argument to the :py:func:`@input_hydration_config ` decorator. Here, we introduce the :py:class:`Selector ` type, which lets you specify mutually exclusive options in config schemas. Here, there's only one option, ``csv``, but you can imagine a more sophisticated data frame type that might also know how to hydrate its inputs from other formats and sources, and might have a selector with fields like ``parquet``, ``xlsx``, ``sql``, etc. Then insert this into the original declaration: -.. literalinclude:: ../../../../examples/dagster_examples/intro_tutorial/custom_types_3.py +.. literalinclude:: ../../../examples/dagster_examples/intro_tutorial/custom_types_3.py :lines: 40-47 :emphasize-lines: 5 :linenos: :lineno-start: 40 :caption: custom_types_3.py Now if you run a pipeline with this solid from dagit you will be able to provide sources for these inputs via config: -.. literalinclude:: ../../../../examples/dagster_examples/intro_tutorial/custom_types_3.py +.. literalinclude:: ../../../examples/dagster_examples/intro_tutorial/custom_types_3.py :lines: 73-79 :linenos: :lineno-start: 73 :caption: custom_types_3.py Testing custom types ^^^^^^^^^^^^^^^^^^^^ As you write your own custom types, you'll also want to set up unit tests that ensure your types are doing what you expect them to. Dagster includes a utility function, :py:func:`check_dagster_type `, that lets you type check any Dagster type against any value. -.. literalinclude:: ../../../../examples/dagster_examples/intro_tutorial/custom_types_test.py +.. literalinclude:: ../../../examples/dagster_examples/intro_tutorial/custom_types_test.py :lines: 97-108 :linenos: :lineno-start: 97 :caption: custom_types_test.py Well tested library types can be reused across solids and pipelines to provide standardized type checking within your organization's data applications. Metadata and data quality checks -------------------------------- Custom types can also yield metadata about the type check. For example, in the case of our data frame, we might want to record the number of rows and columns in the dataset when our type checks succeed, and provide more information about why type checks failed when they fail. User-defined type check functions can optionally return a :py:class:`TypeCheck ` object that contains metadata about the success or failure of the type check. Let's see how to use this to emit some summary statistics about our DataFrame type: -.. literalinclude:: ../../../../examples/dagster_examples/intro_tutorial/custom_types_4.py +.. literalinclude:: ../../../examples/dagster_examples/intro_tutorial/custom_types_4.py :lines: 17-69 :linenos: :lineno-start: 17 :emphasize-lines: 3-9, 33-53 :caption: custom_types_4.py A :py:class:`TypeCheck ` must include a ``success`` argument describing whether the check passed or failed, and may include a description and/or a list of :py:class:`EventMetadataEntry ` objects. You should use the static constructors on :py:class:`EventMetadataEntry ` to construct these objects, which are flexible enough to support arbitrary metadata in JSON or Markdown format. Dagit knows how to display and archive structured metadata of this kind for future review: .. thumbnail:: custom_types_figure_two.png Expectations ^^^^^^^^^^^^ Custom type checks and metadata are appropriate for checking that a value will behave as we expect, and for collecting summary information about values. But sometimes we want to make more specific, data- and business logic-dependent assertions about the semantics of values. It typically isn't appropriate to embed assertions like these into data types directly. For one, they will usually vary substantially between instantiations -- for example, we don't expect all data frames to have the same number of columns, and over-specifying data types (e.g., ``SixColumnedDataFrame``) makes it difficult for generic logic to work generically (e.g., over all data frames). What's more, these additional, deeper semantic assertions are often non-stationary. Typically, you'll start running a pipeline with certain expectations about the data that you'll see; but over time, you'll learn more about your data (making your expectations more precise), and the process in the world that generates your data will shift (making some of your expectations invalid.) To support this kind of assertion, Dagster includes support for expressing your expectations about data in solid logic. -.. literalinclude:: ../../../../examples/dagster_examples/intro_tutorial/custom_types_5.py +.. literalinclude:: ../../../examples/dagster_examples/intro_tutorial/custom_types_5.py :lines: 93-135 :linenos: :lineno-start: 93 :emphasize-lines: 1-3, 31 :caption: custom_types_5.py Until now, every solid we've encountered has returned its result value, or ``None``. But solids can also yield events of various types for side-channel communication about the results of their computations. We've already encountered the :py:class:`TypeCheck ` event, which is typically yielded by the type machinery (but can also be yielded manually from the body of a solid's compute function); :py:class:`ExpectationResult ` is another kind of structured side-channel result that a solid can yield. These extra events don't get passed to downstream solids and they aren't used to define the data dependencies of a pipeline DAG. If you're already using the `Great Expectations `_ library to express expectations about your data, you may be interested in the ``dagster_ge`` wrapper library. This part of this system remains relatively immature, but yielding structured expectation results from your solid logic means that in future, tools like Dagit will be able to aggregate and track expectation results, as well as implement sophisticated policy engines to drive alerting and exception handling on a deep semantic basis. diff --git a/docs/sections/learn/tutorial/types_figure_one.png b/docs/sections/intro_tutorial/types_figure_one.png similarity index 100% rename from docs/sections/learn/tutorial/types_figure_one.png rename to docs/sections/intro_tutorial/types_figure_one.png diff --git a/docs/sections/learn/tutorial/types_figure_three.png b/docs/sections/intro_tutorial/types_figure_three.png similarity index 100% rename from docs/sections/learn/tutorial/types_figure_three.png rename to docs/sections/intro_tutorial/types_figure_three.png diff --git a/docs/sections/learn/tutorial/types_figure_two.png b/docs/sections/intro_tutorial/types_figure_two.png similarity index 100% rename from docs/sections/learn/tutorial/types_figure_two.png rename to docs/sections/intro_tutorial/types_figure_two.png diff --git a/docs/sections/learn/learn.rst b/docs/sections/learn/index.rst similarity index 86% rename from docs/sections/learn/learn.rst rename to docs/sections/learn/index.rst index 33a657ee4..88979339b 100644 --- a/docs/sections/learn/learn.rst +++ b/docs/sections/learn/index.rst @@ -1,18 +1,17 @@ Learn ===== -If you're brand new to the Dagster system, we recommend starting with the tutorial, which will walk +If you're brand new to the Dagster system, we recommend starting with the `intro tutorial <../intro_tutorial/>`_ +, which will walk you through the most important concepts in Dagster from the beginning. The principles section articulates the fundamental principles underpinning the design of Dagster. Finally, the guides section provides deep dives into major areas of the system, motivated by common data engineering and data science workflows. .. toctree:: :maxdepth: 2 - tutorial/index - airline_demo principles guides/index diff --git a/docs/sections/reference/reference.rst b/docs/sections/reference/index.rst similarity index 100% rename from docs/sections/reference/reference.rst rename to docs/sections/reference/index.rst diff --git a/examples/dagster_examples/airline_demo/README.md b/examples/dagster_examples/airline_demo/README.md index ae8c83dfe..cd4df30ab 100644 --- a/examples/dagster_examples/airline_demo/README.md +++ b/examples/dagster_examples/airline_demo/README.md @@ -1,3 +1,4 @@ # Airline Demo + Check out the readme for the Airline demo in -[our docs](https://dagster.readthedocs.io/en/latest/sections/learn/airline_demo.html). +[our docs](https://dagster.readthedocs.io/en/latest/sections/demos/airline_demo.html).