Preset schema

The following schema describes the structure of a preset.yaml file that is compatible with the DASL workspace:
$schema: "https://json-schema.org/draft/2020-12/schema"
type: object
properties:
  name:
    description: >
      The name of the preset. Preset names must be unique and should follow established patterns such as
      "${source}_${sourceType}_${optionalInfo}". User defined presets should begin with "internal_".
    type: string
  author:
    description: >
      The person or organization that authored this preset.
    type: string
  description:
    description: >
      A description of the preset that will be displayed in the UI. The included content should be a brief and may
      include information about the preset's purpose, the data it is designed to work with, and any other relevant
      information.
    type: string
  title:
    description: >
      A formatted, human-readable title for the preset. This text will be displayed in the UI.
    type: string
  iconURL:
    description: >
      A URL to an icon representing the preset. This icon will be displayed in the UI.
    type: string
    format: uri
  autoloader:
    description: >
      The databricks autoloader configuration for this preset. This configuration will be used to load data from the source
      during the bronze stage of the data pipeline. The fields available mirror those found in the DataSourceSpec
      except for the location field which is not available here.
    type: object
    properties:
      format:
        type: string
        description: json | parquet | csv | kafka | txt | cloudFiles
      schemaFile:
        type: string
        description: An optional file containing the schema of the data source.
      schema:
        type: string
        description: An optional string representing the schema of the data source.
      multiline:
        type: boolean
        description: >
          A flag indicating whether the JSON records span multiple lines. Only applies if the format is set to
          'json'.
      cloudFiles:
        type: object
        description: "The configuration used by the autoloader if format is 'cloudFiles'."
        properties:
          schemaHintsFile:
            type: string
          schemaHints:
            type: string
    required:
      - format
  silver:
    type: object
    properties:
      preTransform:
        description: >
          A list of pretransform definitions that will be used in later stages. Each item will include a name of the
          pretransform (used for dataframe variable naming) as well as the FieldSpecs that define the column schema.
          Other typical properties such as filter, postFilter, and FieldUtils may be included as well.
        type: array
        items:
          type: object
          properties:
            name:
              description: >
                The name of this pretransformation definition. This is used to identify the data transformation
                (dataframe) for later use.
              type: string
            filter:
              description: >
                A SQL filter to apply to the data at the beginning of a data transformation stage.
              type: string
            postFilter:
              description: >
                A SQL filter to apply at the end of a data transformation stage.
              type: string
            fields:
              description: >
                A list of FieldSpecs that define the column schema for this pretransformation. Each FieldSpec will
                include a name, type, and other properties that define the column schema.
              type: array
              items:
                $ref: '#/components/schemas/FieldSpec'
            utils:
              description: >
                A list of utilities for handling fields, including managing unreferenced fields and extracting fields
                from hierarchical or JSON objects.
              $ref: '#/components/schemas/FieldUtils'
      transform:
        description: >
          A list of silver transform definitions for data cleaning and processing. Each transformation item will
          include the name of the transformation which is used for variable and table naming. Other typical properties
          such as filter, postFilter, and FieldUtils may be included as well.
        type: array
        items:
          type: object
          properties:
            name:
              description: >
                The name of this silver transform definition. This is used to identify the silver table for writes
                as well as the data transformation variable (dataframe) for later use in the generated notebook.
              type: string
            filter:
              description: >
                A SQL filter to apply to the data at the beginning of a data transformation stage.
              type: string
            postFilter:
              description: >
                A SQL filter to apply at the end of a data transformation stage.
              type: string
            fields:
              description: >
                A list of FieldSpecs that define the column schema for this data transformation. Each FieldSpec will
                include a name, type, and other properties that define the column schema.
              type: array
              items:
                $ref: '#/components/schemas/FieldSpec'
            utils:
              description: >
                A list of utilities for handling fields, including managing unreferenced fields and extracting fields
                from hierarchical or JSON objects.
              $ref: '#/components/schemas/FieldUtils'
  gold:
    description: >
      The gold transform configuration for this preset. This configuration will be used to transform the silver data
      into the gold OCSF tables. Note that you can have duplicate names for array objects which indicates the same
      destination table but different source (silver) tables.
    type: array
    items:
      type: object
      properties:
        name:
          description: >
            The name of this gold transform definition used to identify the gold table for writes.
          type: string
        input:
          description: >
            The silver transform data used as input for this gold table definition.
          type: string
        filter:
          description: >
            A SQL filter to apply to the data at the beginning of this stage.
          type: string
        postFilter:
          description: >
            A SQL filter to apply at the end of this stage.
          type: string
        fields:
          description: >
            A list of FieldSpecs that define the column schema for the gold stage data. Each FieldSpec will
            include a name, type, and other properties that define the column schema.
          type: array
          items:
            $ref: '#/components/schemas/FieldSpec'

components:
  required:
    - name
    - author
    - description
    - title
    - iconURL
  schemas:
    FieldSpec:
      description: >
        A FieldSpec object is used to define a field to add to a dataset (dataframe). This field can be derived from an existing field,
        an expression, from a literal value, from a join, etc. FieldSpec objects are found wherever the user needs to transform data
        including in datasources, custom notebooks, and transforms.
      type: object
      properties:
        name:
          type: string
        comment:
          type: string
          description: The comment to apply to the field.
        assert:
          type: array
          description: >
            A list of SQL expressions that must evaluate to true for every processed row.
            The name of the column can be used in the SQL expression. If the assertion is
            false, an operational alert is raised using 'message' for each row.
          items:
            type: object
            properties:
              expr:
                type: string
              message:
                description: >
                  The message to include in the operational alert if the assertion fails. (E.g., "The user email is null").
                type: string
        from:
          type: string
          description: This field obtains its value from the source column of this names.
        alias:
          type: string
          description: This field obtains its value from the destination (transformed) column of this name.
        expr:
          type: string
          description: This field obtains its value from the given SQL expression.
        literal:
          type: string
          description: This field obtains its value from the given literal string. For other data types, use expr.
        join:
          type: object
          properties:
            withTable:
              type: string
              description: The table to join to.
            withCSV:
              type: object
              properties:
                path:
                  type: string
                  description: The path to the CSV file.
            lhs:
              type: string
              description: The column in the source dataframe to join on.
            rhs:
              type: string
              description: The column in withTable (or withCSV) to join on.
            select:
              type: string
              description: A SQL expression to create the new field from the joined dataset.
    FieldUtils:
    description: >
      Defines utilities for handling fields, including managing unreferenced fields and extracting fields from
      hierarchical or JSON objects.
    type: object
    properties:
      unreferencedColumns:
        type: object
        properties:
          preserve:
            description: >
              Indicates whether columns not referenced in the FieldSpecs should be preserved in the output. This only
              applies to the name in the "from" attribute.
            type: boolean
          embedColumn:
            description: >
              Specifies a name for a new column to contain all unreferenced fields as a single structured object.
              Only applies if preserve is true.
            type: string
          embedColumnType:
            description: >
              Specifies the type of structured object that that unreferenced fields will be contained in.
              Supported values are: json | struct | variant. Only applies if embedColumn is set.
            type: string
          omitColumns:
            description: >
              Lists columns to exclude from the output (either preserved as columns or embedded). Useful for retaining
              all columns except the specified ones.
            type: array
            items:
              type: string
          duplicatePrefix:
            description: >
              Adds a specified prefix to resolve ambiguous duplicate field names. This applies only to "preserved"
              columns that may be duplicative with something in the field specs.
            type: string
      jsonExtract:
        description: "TODO: this happens before unreferencedColumns"
        type: array
        items:
          type: object
          properties:
            source:
              description: >
                The name of the column containing the JSON string from which fields will be extracted.
              type: string
            omitFields:
              description: >
                Specifies high-level fields to exclude from extraction.
              type: array
              items:
                type: string
            duplicatePrefix:
              description: >
                Adds a specified prefix to resolve ambiguous duplicate field names generated during extraction.
              type: string
            embedColumn:
              description: >
                Specifies a column name to store the extracted JSON object as a structured type. If not specified,
                the object is extracted into the top-level structure.
              type: string