# Model Schema

Model schema is a specification of input and output of a model, such as what are the features columns, prediction columns and also ground truth columns. Following are the fields in model schema:

| Field      | Type            | Description                             | Mandatory                                                                                                                             |
| ---------- | --------------- | --------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------- |
| `id`       | int             | Unique identifier for each model schema | Not mandatory, if ID is not specified it will create new model schema otherwise it will update the model schema with corresponding ID |
| `model_id` | int             | Model ID that correlate with the schema | Not mandatory, if not specified the SDK will assign it with the model that user set                                                   |
| `spec`     | InferenceSchema | Detail specification for model schema   | True                                                                                                                                  |

Detail specification is defined by using `InferenceSchema` class, following are the fields:

| Field                     | Type                  | Description                                                                                                               | Mandatory |
| ------------------------- | --------------------- | ------------------------------------------------------------------------------------------------------------------------- | --------- |
| `feature_types`           | Dict\[str, ValueType] | Mapping between feature name with the type of the feature                                                                 | True      |
| `model_prediction_output` | PredictionOutput      | Prediction specification that differ between model types, e.g BinaryClassificationOutput, RegressionOutput, RankingOutput | True      |
| `session_id_column`       | str                   | The column name that is unique identifier for a request                                                                   | True      |
| `row_id_column`           | str                   | The column name that is unique identifier for a row in a request                                                          | True      |
| `tag_columns`             | Optional\[List\[str]] | List of column names that contains additional information about prediction, you can treat it as metadata                  | False     |

From above we can see `model_prediction_output` field that has type `PredictionOutput`, this field is a specification of prediction that is generated by the model depending on it's model type. Currently we support 3 model types in the schema:

* Binary Classification
* Regression
* Ranking

Each model type has it's own model prediction output specification.

## Binary Classification

Model prediction output specification for Binary Classification type is `BinaryClassificationOutput` that has following fields:

| Field                     | Type  | Description                                                                                          | Mandatory                                          |
| ------------------------- | ----- | ---------------------------------------------------------------------------------------------------- | -------------------------------------------------- |
| `prediction_score_column` | str   | Column that contains prediction score value of a model. Prediction score must be between 0.0 and 1.0 | True                                               |
| `actual_label_column`     | str   | Name of the column containing the actual class                                                       | False, because not all model has the ground truth  |
| `positive_class_label`    | str   | Label for positive class                                                                             | True                                               |
| `negative_class_label`    | str   | Label for negative class                                                                             | True                                               |
| `score_threshold`         | float | Score threshold for prediction to be considered as positive class                                    | False, if not specified it will use 0.5 as default |

## Regression

Model prediction output specification for Regression type is `RegressionOutput` that has following fields:

| Field                     | Type | Description                                            | Mandatory                                         |
| ------------------------- | ---- | ------------------------------------------------------ | ------------------------------------------------- |
| `prediction_score_column` | str  | Column that contains prediction score value of a model | True                                              |
| `actual_score_column`     | str  | Name of the column containing the actual score         | False, because not all model has the ground truth |

## Ranking

Model prediction output specification for Ranking type is `RankingOutput` that has following fields:

| Field                        | Type | Description                                                         | Mandatory |
| ---------------------------- | ---- | ------------------------------------------------------------------- | --------- |
| `rank_score_column`          | str  | Name of the column containing the ranking score of the prediction   | True      |
| `prediction_group_id_column` | str  | Name of the column containing the prediction group id               | True      |
| `relevance_score_column`     | str  | Name of the column containing the relevance score of the prediction | True      |

## Define model schema

From the specification above, users can create the schema for their model. Suppose that users have binary classification model, that has 4 features

* featureA that has float type
* featureB that has int type
* featureC that has string type
* featureD that has float type

With positive class `complete` and negative class `non_complete` and the threshold for positive class is 0.75. Actual label is stored under column `target`, `prediction_score` under column `score` `prediction_id` under column `prediction_id`. From that specification, users can define the model schema and put it alongside version creation. Below is the example snipped code

```python
from merlin.model_schema import ModelSchema
from merlin.observability.inference import InferenceSchema, ValueType, BinaryClassificationOutput
 model_schema = ModelSchema(spec=InferenceSchema(
        feature_types={
            "featureA": ValueType.FLOAT64,
            "featureB": ValueType.INT64,
            "featureC": ValueType.STRING,
            "featureD": ValueType.BOOLEAN
        },
        session_id_column="session_id",
        row_id_column="row_id",
        model_prediction_output=BinaryClassificationOutput(
            prediction_score_column="score",
            actual_label_column="target",
            positive_class_label="complete",
            negative_class_label="non_complete",
            score_threshold=0.75
        )
    ))
with merlin.new_model_version(model_schema=model_schema) as v:
    ....

```

The above snipped code will define model schema and attach it to certain model version, the reason is the schema for each version is possible to differ.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.caraml.dev/user-guides/01_getting_started/10_model_schema.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
