Learning to Detect Humans-Objects Interactions

Imagine, one day, you could ask Siri: "Hey Siri, show me the photos in which I was riding that amazing horse!" or "Do you remember the selfie I took when I was kissing my cute little puppy Max?"

Needless to say, virtual assistants today don't actually understand photos. All they could do is to pull out the metadata associated with that photo, for example, GPS location, time taken, surrounding temperature, etc. Therefore, they could only answer queries like "Show me the pictures I took in downtown Manhattan (at midnight)."

img2 img3

We want to help computers actually understand the content of the image. Computers trained by us would be able to locate human beings and objects in the image, and figure out what action the human being is performing to the object. And the interactive questions in the beginning of this section, would indeed, be possible.

In order to do this, we could like to ask Internet workers to circle out the locations of human beings and objects in a batch of images. My role is to develop such an interface to help the workers do their jobs way faster and more accurate.

Interface demo (Drawing Polygons Version)
Interface demo (Drawing Tight Boxes Version)
Interface demo (Most Recent Version)
Complete development folder (Simple-amt based)

Yu-Wei Chao, Yunfan Liu, Xieyang Liu, Huayi Zeng, Jia Deng. Learning to Detect Human-Object Interactions. (arXiv.org)

Go to top