BYOM Code Modifications

Matrice.ai supports the Bring Your Own Model (BYOM) feature, which allows users to integrate any deep learning model into the platform provided it meets our platform requirements. There are five model actions that your code need to support - training, evaluation, prediction, export and deployment. For each model action, action ID is the input argument; the action ID is used to gather all the configuration parameters required for performing the action. Each action can be divided into multiple steps, each individual action step needs to be updated with correct status code - OK, ERROR or SUCCESS. Please correctly specify the well-defined metrics, frameworks and runtimes as mentioned in sections below. We have provided a sample codebase for training PyTorch vision classification models.

train.py

Your codebase must include a train.py file which takes actionID as the input argument for training your models.

import sys
if __name__ == "__main__":
    if len(sys.argv) != 2:
        print("Usage: python3 train.py <action_id>")
        sys.exit(1)
    action_id = sys.argv[1]
    main(action_id)

In your training function, you should pass the actionID and modify the script as suggested below to ensure successful integration into our platform:

Getting the actionTracker and job parameters

You must create an actionTracker and get all the configuration parameters for training your model using the action ID as shown below:
```
import sys
from matrice_sdk.actionTracker import ActionTracker
try:
    actionTracker = ActionTracker(action_id)
    model_config = actionTracker.get_job_params()
    actionTracker.update_status('MDL_TRN_ACK', 'OK', 'Model training has been acknowledged')
except:
    sys.exit(0)
```
Once you have the actionTracker and model_config, you must use them to update the action status, get the required parameters and use them for training your model.
Loading dataset and saving the class mapping

We will prepare the dataset for model training in one of the standard formats supported by our platform. For classification, the dataset will be prepared in ImageNet folder structure and format. For object detection, we use MSCOCO or YOLO format based on the input format required by the model. You can get the dataset path using the model_config to load the dataset as shown below:
```
try:
    dataset_path = model_config['dataset_path']
    train_loader, val_loader, test_loader = load_dataset(dataset_path)
    index_to_labels = {str(idx): str(label) for idx, label in enumerate(train_loader.dataset.classes)}
    actionTracker.add_index_to_category(index_to_labels)
    actionTracker.update_status('MDL_TRN_DTL', 'OK', 'Training dataset is loaded')
except:
    actionTracker.update_status('MDL_TRN_DTL', 'ERROR', 'Error in loading training dataset')
```
Once the dataset is loaded, you must update the index to label mapping for later use during prediction and evaluation.

Creating model from scratch or checkpoint You must pay attention to properly creating your model using correct checkpoint if one exist in the actionTracker, otherwise from scratch using the model_key in the model_config. The last layer must be modified to match the number of class in the dataset. Please update your code accordingly, similar to as shown below:

try:
    checkpoint_path, pretrained = actionTracker.get_checkpoint_path(model_config)
    if checkpoint_path:
        model = YOLO(checkpoint_path) # Load from checkpoint
    else:
        if pretrained:
            model = YOLO(model_config['model_key'] + '.pt') # Load the model from pretrained weights
        else:
            model = YOLO(model_config['model_key'] + '.yaml') # Load the model without pretrained weights
    actionTracker.update_status('MDL_TRN_MDL', 'OK', 'Model has been loaded')
except:
    actionTracker.update_status('MDL_TRN_MDL', 'ERROR', 'Error in loading model')

Note that the checkpoint path will be of an existing model trained previously if it is provided. If pretrained is true, you must load the model automatically using the default pretrained model in your code base; generally pretrained models are trained on standard large datasets like ImageNet for classification and MSCOCO for detection.

Model Training and Epoch Logging You must create the training method using the parameters present in the model_config. Please use the correct optimizer, learning_rate and other training configuration similar to as shown below:

try:
    optimizer = model_config['optimizer']
    learning_rate = model_config['learning_rate']
    optimizer = setup_optimizer(model, optimizer, learning_rate)
    ...
    actionTracker.update_status('MDL_TRN_STR', 'OK', 'Model training is starting')
except:
    actionTracker.update_status('MDL_TRN_STR', 'ERROR', 'Error in setting up model training')

Once the model training is started, you must log the epoch results using valid metrics for each epoch, similar to as shown below:

try:
    epochDetails= [
    {"splitType": "train", "metricName": "loss", "metricValue":loss_train},
    {"splitType": "train", "metricName": "acc@1", "metricValue": acc1_train},
    {"splitType": "train", "metricName": "acc@5", "metricValue": acc5_train},
    {"splitType": "val", "metricName": "loss", "metricValue": loss_val},
    {"splitType": "val", "metricName": "acc@1", "metricValue": acc1_val},
    {"splitType": "val", "metricName": "acc@5", "metricValue": acc5_val}]
    actionTracker.log_epoch_results(epoch ,epochDetails)
except:
    actionTracker.update_status('MDL_TRN_EPOCH', 'ERROR', 'Error in logging training epoch details')

Please note that the epoch details is a list of entries, where each entry contains the splitType, metriceName and metricValue.

Saving the best model Once you find that your model is best, you must save the model as well as its complete state using the actionTracker. While the model will be used for running evaluation, exporting and deployment, the complete model state will be required if you want to use it as a checkpoint for fine-tuning later. Please save necessary models as shown below:

try:
    ## For exporting, evaluation and deployment
    torch.save(best_model, 'model_best.pt')
    actionTracker.upload_checkpoint('model_best.pt')
    actionTracker.update_status('MDL_TRN_CMPL', 'OK', 'Model Training is completed')
    actionTracker.update_status('MDL_TRN_BMS', 'OK', 'Best model saved')
except:
    actionTracker.update_status('MDL_TRN_BMS', 'ERROR', 'Error in saving the best model')

Running Evaluation using Best Model Once the model has finished training for the required epoch, you must evaluate the best model on both test and validation sets and save the results. You can follow something similar as below:

from eval import get_metrics
try:
    payload=[]
    ## Run on validation set
    if  os.path.exists(valdir):
        payload+=get_metrics('val',val_loader, best_model, index_to_labels)
    ## Run on test set
    if  os.path.exists(testdir):
        payload+=get_metrics('test',test_loader, best_model, index_to_labels)
    
    actionTracker.save_evaluation_results(payload)
    status = 'SUCCESS'
    status_description='Model training is completed'
    actionTracker.update_status('MDL_TRN_SUCCESS', 'OK', 'Model training is completed and best model is saved successfully')
    actionTracker.update_status('MDL_TRN_EVL', 'SUCCESS', 'Model evaluation is completed')
except:
    actionTracker.udpate_status('MDL_TRN_EVL', 'ERROR', 'Error in evaluation using the best model')

Or you can use the performance metrics for your models with the actionTracker object:

Make sure to have the outputs and targets following structure

# outputs
import torchvision.models as models
# with project_type='classification'
classification_outputs = models.resnet18()(images)
# with project_type='detection'
detection_outputs = models.detection.fasterrcnn_resnet50_fpn()(images)

# in detection outputs and targets should be a list of dicts with 'bboxes'

targets = [{'boxes': tensor([], size=(0, 4)), 'labels': tensor([], dtype=torch.int64)}]
outputs = [{'boxes': tensor([], size=(0, 4)), 'labels': tensor([], dtype=torch.int64), 'scores': tensor([])}]

# and the targets are the labels from the training loop

# FULL e.g.
all_outputs = []
all_targets = []

with torch.no_grad():
    for i, (images, target) in enumerate(loader):
        
    images = images.to(device)
    target = target.to(device)

    output = model(images)
    predictions = torch.argmax(output, dim=1)

    all_predictions.append(predictions)
    all_outputs.append(output)
    all_targets.append(target)

all_outputs = torch.cat(all_outputs, dim=0)
all_targets = torch.cat(all_targets, dim=0)

Calculate The Metrics with the actionTracker

metrics = actionTracker.calculate_metrics(split_type='val', outputs=model_outputs, targets=model_targets, project_type='classification')
actionTracker.save_evaluation_results(metrics)

Saving some inference results (Optional)

You can save the inference results of your model

    images: torch tensor [batch_size, channels, height, width]
    outputs: model output in the structure mentioned above in actionTracker.calculate_metrics
    targets: ground truth labels in the same structure as outputs
    split_type: "val" or "test"
    project_type: "classification" or "detection"

    actionTracker.store_inference_results(images, outputs, targets, split_type, project_type)

Or you can save it when caculating the metrics with the actiontracker by passing the images

actionTracker.calculate_metrics(split_type='val', outputs=model_outputs, targets=model_targets, project_type='classification', images = images)

eval.py

Model Evaluation

Your codebase must include an eval.py file which takes action_status_id as the input argument for evaluating your models.

import sys
if __name__ == "__main__":
    if len(sys.argv) != 2:
        print("Usage: python3 eval.py <action_status_id>")
        sys.exit(1)
    action_status_id = sys.argv[1]
    main(action_status_id)

In your evaluation function, you should pass the actionID and modify the script as suggested below to ensure successful integration into our platform

Initializing ActionTracker and Starting Model Evaluation

You must create an ActionTracker and acknowledge the start of the model evaluation process using the actionID as shown below:

from matrice_sdk.actionTracker import ActionTracker
try:
    actionTracker = ActionTracker(action_id)
    actionTracker.update_status('MDL_EVL_ACK', 'OK', 'Model Evaluation has acknowledged')
except Exception as e:
    print(f"Error initializing ActionTracker: {str(e)}")
    sys.exit(0)

Loading Test Data

You must load the test dataset using the configuration parameters from model_config. Update the action status accordingly:

from matrice_sdk.actionTracker import ActionTracker
from python_common.services.utils import log_error   
try:
    actionTracker.model_config.data = f"workspace/{actionTracker.model_config['dataset_path']}/images"
    val_loader, test_loader = load_data(actionTracker.model_config) 
    actionTracker.update_status('MDL_EVL_DTL', 'OK', 'Testing dataset is loaded')  
except Exception as e:
    actionTracker.update_status('MDL_EVL_ERR', 'ERROR', f'Error in loading dataset: {str(e)}')
    log_error(__file__, 'ml_pytorch_vision_classification/main', f'Error updating status to MDL_EVL_DTL: {str(e)}')
    print(f"Error updating status to MDL_EVL_DTL: {str(e)}")
    sys.exit(0)

Loading the Model

You must load the model from the specified path and set parameters up for evaluation:

try:
    actionTracker.download_model('model.pt')
    model = torch.load('model.pt', map_location='cpu')
    actionTracker.update_status('MDL_EVL_MDL', 'OK', 'Successfully loaded model for evaluation')
    actionTracker.model_config.batch_size = 32
    actionTracker.model_config.workers = 4
    device = update_compute(model)
    criterion = nn.CrossEntropyLoss().to(device)
    actionTracker.update_status('MDL_EVL_STR', 'OK', 'Model Evaluation has started')
except Exception as e:
    actionTracker.update_status('MDL_EVL_ERR', 'ERROR', f'Error in starting Evaluation: {str(e)}')
    log_error(__file__, 'ml_pytorch_vision_classification/main', f'Error updating status to MDL_EVL_STRT: {str(e)}')
    print(f"Error updating status to MDL_EVL_STRT: {str(e)}")
    sys.exit(1)

Evaluating on Test Dataset

You must evaluate the model on both validation and test datasets and save the evaluation results:

    try:
        index_to_labels = actionTracker.get_index_to_category()
        payload = []
        
        if 'val' in actionTracker.model_config.split_types and os.path.exists(os.path.join(actionTracker.model_config.data, 'val')):
            payload += get_metrics('val', val_loader, model, index_to_labels)

        if 'test' in actionTracker.model_config.split_types and os.path.exists(os.path.join(actionTracker.model_config.data, 'test')):
            payload += get_metrics('test', test_loader, model, index_to_labels)

        actionTracker.save_evaluation_results(payload)
        actionTracker.update_status('MDL_EVL_CMPL', 'SUCCESS', 'Model Evaluation is completed')

    except Exception as e:
        actionTracker.update_status('MDL_EVL_ERR', 'ERROR', f'Error in completing Evaluation: {str(e)}')
        log_error(__file__, 'ml_pytorch_vision_classification/main', f'Error updating status to MDL_EVL_CMPL: {str(e)}')
        print(f"Error updating status to MDL_EVL_CMPL: {str(e)}")
        sys.exit(1)

Or you can use the performance metrics for your models with the actionTracker object:

metrics = actionTracker.calculate_metrics(split_type='val', outputs=model_outputs, targets=model_targets, metrics_type='classification')
actionTracker.save_evaluation_results(metrics)

export.py

The purpose of the export.py script is to export a trained YOLO model to different formats specified by the user. The script downloads the model, configures export parameters, performs the export process, and uploads the exported files.

Export Status Table

Status	Code	Description
OK	MDL_EXP_ACK	Model Export Acknowledged
OK	MDL_EXP_STR	Model Export Started
SUCCESS	MDL_EXP_CMPL	Model Export Completed
ERROR	MDL_EXP_ERR	Error in model export

Export Workflow

Initialization:
- Initialize ActionTracker to manage and track model export actions.
- Update the status to indicate that model export has been acknowledged.
- Status Code: MDL_EXP_ACK - Model Export Acknowledged.
- Usage:
```
actionTracker = ActionTracker(action_id)
actionTracker.update_status('MDL_EXP_ACK', 'OK', 'Model Export Acknowledged')
```
Download Model:
- Download the model to be exported by the following command
- Usage:
```
actionTracker.download_model("yolo.pt")
```

Load Model:

Load the YOLO model using the downloaded file.

Usage:

try:
    model = YOLO("yolo.pt")
    actionTracker.update_status('MDL_EXP_MDL', 'OK', 'Successfully loaded model for export')
except Exception as e:
actionTracker.update_status('MDL_TRN_MDL', 'ERROR', f'Error in loading model: {str(e)}')
log_error(__file__, 'ml_pytorch_vision_classification/main', f'Error updating status to MDL_EXP_STR: {str(e)}')
print(f"Error updating status to MDL_EXP_STR: {str(e)}")
sys.exit(1)

Start Export:
- Update the status to indicate that model export has started.
- Status Code: MDL_EXP_STR - Model Export Started.
- Usage:
```
actionTracker.update_status('MDL_EXP_STR', 'OK', 'Model Export Started')
```
Perform and upload Export:
- Export the model to the specified formats with the specified options.
- Upload the exported files to the specified destination.
- Usage:
```
model.export(format='onnx')
actionTracker.upload_checkpoint('model.onnx')
```
Complete Export:
- Update the status to indicate that model export is completed successfully or report any errors if they occur during export.
- Status Code: MDL_EXP_CMPL - Model Export Completed.
- Usage:
```
actionTracker.update_status('MDL_EXP_CMPL', 'SUCCESS', 'Model Export Completed')
```

Running the Script

To run the evaluation script, use the following command, replacing <action_status_id> with the appropriate action status ID.

python3 export.py <action_status_id>

deploy.py

Model Deployment The purpose of the deploy.py script is to deploy a trained YOLO model as a web service. The script utilizes the Matrice SDK to handle the deployment process, including loading the model, defining prediction logic, and starting the server.

Your codebase must include an deploy.py file which takes action_status_id as the input argument for deploying your models.

import sys
if __name__ == "__main__":
    if len(sys.argv) != 2:
        print("Usage: python3 deploy.py <action_status_id>")
        sys.exit(1)
    main(sys.argv[1])

In your deployment function, you should pass the actionID and modify the script as suggested below to ensure successful integration into our platform

Deploying the Model

Create a script that utilizes the MatriceDeploy() object from matrice sdk and invoke start_server() to deploy the model

def load_model(actionTracker):
    actionTracker.download_model('model.pt')
    model = torch.load('model.pt', map_location='cpu')
    return model

def predict(model, image_bytes):
    pass

try:
    from matrice.deploy import MatriceDeploy
    actionTracker.update_status('MDL_DPY_ACK', 'OK', 'Model Deployment has been acknowledged')
    x = MatriceDeploy(load_model, predict, action_id)
    actionTracker.update_status('MDL_DPY_MDL', 'OK', 'Successfully loaded model for deployment')
    x.start_server()
    actionTracker.update_status('MDL_DPY_STR', 'OK', 'Model Deployment started')

    
except Exception as e:
    actionTracker.update_status('MDL_DPY_ERR', 'ERROR', 'Error in model deployment : ' + str(e))
    sys.exit(1)
    return