Hello Yu Cai,
Welcome to the Microsoft Q&A and thank you for posting your questions here.
I understand that you are looking for possible ways to use Azure video indexer to explain the video content and quickly locate the video clip.
Azure Video Indexer is not designed for direct training or fine-tuning, but you can extend its capabilities using custom models, Logic Apps, and OpenAI. For tasks like shoplifting detection, you must build a custom pipeline using AVI for indexing and your own model for classification. AVI is not a trainable model in the traditional ML sense. It uses pre-trained models for object detection, speech recognition, and sentiment analysis. You cannot fine-tune these models directly. However, you can overcome AVI’s limitations by integrate your own custom model using Azure Logic Apps and Azure OpenAI or Azure AI Computer Vision using the following steps:
- Index the video using AVI.
- Extract frames or object metadata using AVI API.
- Send frames to your custom model (hosted on Azure AI).
- Classify objects/actions using your model (e.g., detect shoplifting).
- Patch AVI insights with corrected labels via API.
The links here will give you more details on the above steps: https://github.com/Azure-Samples/azure-video-indexer-samples/blob/master/BringYourOwn-Samples/README.MD and https://www.youtube.com/watch?v=yMqJufR9Rfs to extend its capabilities using custom models, Logic Apps, and OpenAI.
I hope this is helpful! Do not hesitate to let me know if you have any other questions or clarifications.
Please don't forget to close up the thread here by upvoting and accept it as an answer if it is helpful.