Harsh Agrawal
My research lies at the intersection of computer vision and natural language processing. Currently I am a Research Engineer at Snapchat Research working with an amazing group of people inventing new ways in which we can make sense of large amounts of visual data available on Snapchat. In my free time, I also actively maintain the CloudCV project which aims to make AI research more reproducible.

Research Engineer

July, 2016

Snapchat Research.

Inventing the next generation of creative tools to “empower people to express themselves, live in the moment, learn about the world, and have fun together.”


M.S in Computer Engineering

May, 2016

Machine Learning and Perception Lab, Virginia Tech.

Advisor: Dr. Dhruv Batra

Worked on problems at the intersection of computer vision, natural language and machine learning.


Organization Administrator for CloudCV

2015, 2016

Google Summer of Code (GSOC), '15, '16

Mentored three GSOC students for the summer who contributed to CloudCV as part of Google Summer of Code, 2015.


Software Engineer Intern

June - August, 2015

Microsoft Dynamics

Worked on applying machine learning techniques to automate work-flows for sales lifecycle in Microsoft Dynamics CRM


B.S. in Computer Engineering

June, 2014

Graduated from Delhi College of Engineering

Worked on applications of computer vision for an Unmanned Aerial Vehicle.


Visiting Student

June, 2014

Machine Learning and Perception Lab, Virginia Tech

Advisor: Dr. Dhruv Batra

Worked on building the first prototype of CloudCV.


Visiting Student Researcher

September, 2012

Mobile and Ubiquitous Computing, IIIT-Delhi

Developed a cloud enabled cell broadcast service based localization algorithms for Android smartphones. Also designed and implemented an algorithm to build mobility profiles for predicting encounters between mobile phone users allowing them to share content locally through bluetooth.


Computer Vision Lead

September, 2010

Unmanned Aerial Systems - Delhi Technological University

Performed autonomous extraction and segmentation of objects from aerial imagery in natural scenes for Student UAS Competition. Developed a robust, user-friendly GUI on Qt to control camera properties and process the imagery feed acquired wirelessly.


Sort Story: Sorting Jumbled Images and Captions into Stories

Temporal common sense has applications in AI tasks such as QA, multi-document summarization, and human-AI communication. We propose the task of sequencing -- given a jumbled set of aligned image-caption pairs that belong to a story, the task is to sort them such that the output sequence forms a coherent story. We present multiple approaches, via unary (position) and pairwise (order) predictions, and their ensemble-based combinations, achieving strong results on this task. As features, we use both text-based and image-based features, which depict complementary improvements. Using qualitative examples, we demonstrate that our models have learnt interesting aspects of temporal common sense.

Harsh Agrawal*, Arjun Chandrasekaran*, Dhruv Batra, Devi Parikh, Mohit Bansal

EMNLP(2016) Short Paper

Human Attention in Visual Question Answering: Do Humans and Deep Networks Look at the Same Regions?

We conduct large-scale studies on `human attention' in Visual Question Answering (VQA) to understand where humans choose to look to answer questions about images. We design and test multiple game-inspired novel attention-annotation interfaces that require the subject to sharpen regions of a blurred image to answer a question. Thus, we introduce the VQA-HAT (Human ATtention) dataset. We evaluate attention maps generated by state-of-the-art VQA models against human attention both qualitatively (via visualizations) and quantitatively (via rank-order correlation). Overall, our experiments show that current attention models in VQA do not seem to be looking at the same regions as humans.

Abhishek Das*, Harsh Agrawal*, C. Lawrence Zitnick, Devi Parikh, Dhruv Batra

EMNLP(2016) Short Paper

Object-Proposal Evaluation Protocol is 'Gameable'

Object proposals have quickly become the de-facto pre-processing step in a number of vision pipelines (for object detection, object discovery, and other tasks). Their performance is usually evaluated on partially annotated datasets. In this paper, we argue that the choice of using a partially annotated dataset for evaluation of object proposals is problematic -- as we demonstrate via a thought experiment, the evaluation protocol is 'gameable', in the sense that progress under this protocol does not necessarily correspond to a "better" category independent object proposal algorithm.

Neelima Chavali*, Harsh Agrawal*, Aroma Mahendru*, Dhruv Batra

CVPR 2016 (Spotlight)

CloudCV: Large Scale Distributed Computer Vision as a Cloud Service

We are witnessing a proliferation of massive visual data. Unfortunately scaling existing computer vision algorithms to large datasets leaves researchers repeatedly solving the same algorithmic, logistical, and infrastructural problems. Our goal is to democratize computer vision; one should not have to be a computer vision, big data and distributed computing expert to have access to state-of-the-art distributed computer vision algorithms. We present CloudCV, a comprehensive system to provide access to state-of-the-art distributed computer vision algorithms as a cloud service through a Web Interface and APIs.

Harsh Agrawal, Clint Solomon Mathialagan, Yash Goyal, Neelima Chavali, Prakriti Banik, Akrit Mohapatra, Ahmed Osman, Dhruv Batra

Book Chapter: Mobile Cloud Visual Media Computing, 265-290