Microsoft Cognitive Services <--> snap!(?)

I did interface Snap! and image recognition services from Microsoft and Google (and originally from IBM but they changed their API and broke things). You can find the guide to using this at Adding image recognition to programs

But note that today one can do a good deal just locally in the browser including image recognition and object detection. See Adding machine learning models to programs for Snap! interfaces to this. The cloud services from Microsoft and Google do provide much more detailed information than the on-device models but one avoids the hassle of API keys and by running locally privacy is preserved and responsiveness increased.

Regarding speech recognition there is also a browser API that Chrome and Edge supports (and FireFox folks have been working on for years). See Adding listening to programs