Here are the steps:
Segmenting Everything with SAM : We detect everything and worry about filtering later.
Filtering with CLIP: Once we have all the segmented objects, we don’t want all of them. We need to filter out the noise and keep only the relevant objects.
Adding Reasoning with a model like GPT-4o: Okay, so we’ve segmented and filtered. But what about finalising, understanding? That’s where a strong LLM like GPT-4o comes in.
Here is what I did with SAM and clip, we now need to use a good LLM on top and add some reasoning..
Top comments (0)