I am a Research Scientist at Apple Machine Learning Research (MLR) working on building generalist interactive and embodied AI systems that will enable humans to be more productive and creative. Towards, that I am interested in imbuing agents with the ability to perceive their surroundings, reason about them, and take actions to accomplish their goals.

Some representative projects towards this goal are:

I received my PhD at Georgia Tech advised by Dhruv Batra in 2023. My PhD was partially supported by the Snap Fellowship 2019 . I also collaborated with Devi Parikh (Georgia Tech), Natasha Jaques (Google Brain), Peter Anderson (Google Research), Gal Chechik (NVIDIA), Marcus Rohrbach (Facebook AI Research), and Alex Schwing (UIUC). Before my PhD, I spent a couple of years as a Research Engineer at Snap Research where I was responsible for building large-scale infrastructure for visual recognition, search and developed algorithms for low-shot instance detection.

Aside from research, I helped maintain and manage an AI challenge hosting platform called EvalAI to make AI research more reproducible. EvalAI hosts 150+ challenges and has 300+ contributors, 2M+ annual pageviews, 1400+ forks, 4500+ solved issues and merged pull requests, 3000+ 'stars' on Github.

Publications
From Multimodal LLMs to Generalist Embodied Agents: Methods and Lessons

From Multimodal LLMs to Generalist Embodied Agents: Methods and Lessons

A. Szot, B. Mazoure, O. Attia, A. Timofeev, H. Agrawal, D. Hjelm, Z. Gan, Z. Kira, A. Toshev

2025

Ferret-UI 2: Mastering Universal User Interface Understanding Across Platforms

Ferret-UI 2: Mastering Universal User Interface Understanding Across Platforms

Z. Li, K. You, H. Zhang, D. Feng, H. Agrawal, X. Li, M. P. S. Moorthy, J. Nichols, Y. Yang, Z. Gan

ICLR 2025

Grounding Multimodal Large Language Models in Actions

Grounding Multimodal Large Language Models in Actions

A. Szot, B. Mazoure, H. Agrawal, D. Hjelm, Z. Kira, A. Toshev

NeurIPS 2024

Large Language Models as Generalizable Policies for Embodied Tasks

Large Language Models as Generalizable Policies for Embodied Tasks

A. Szot, M. Schwarzer, H. Agrawal, B. Mazoure, W. T. K. Metcalf, N. Mackraz, D. Hjelm, A. Toshev

ICLR, 2024

Simple and Effective Synthesis of Indoor 3D Scenes

Simple and Effective Synthesis of Indoor 3D Scenes

J. Y. Koh*, H. Agrawal*, D. Batra, R. Tucker, A. Waters, H. Lee, Y. Yang, J. Baldridge, P. Anderson

AAAI 2023

Housekeep: Tidying Virtual Households using Commonsense Reasoning

Housekeep: Tidying Virtual Households using Commonsense Reasoning

Y. Kant, A. Ramachandran, S. Yenamandra, I. Gilitschenski, D. Batra, A. Szot*, H. Agrawal*

ECCV 2022

SOAT: A Scene- and Object-Aware Transformer for Vision-and-Language Navigation

SOAT: A Scene- and Object-Aware Transformer for Vision-and-Language Navigation

A. Moudgil, A. Majumdar, H. Agrawal, S. Lee, D. Batra

NeurIPS 2021

Known unknowns: Learning novel concepts using reasoning-by-elimination

Known unknowns: Learning novel concepts using reasoning-by-elimination

H. Agrawal, E. A. Meirom, Y. Atzmon, S. Mannor, G. Chechik

UAI 2021 (Long Talk)

The Surprising Effectiveness of Visual Odometry Techniques for Embodied PointGoal Navigation

The Surprising Effectiveness of Visual Odometry Techniques for Embodied PointGoal Navigation

X. Zhao, H. Agrawal, D. Batra, A. Schwing

ICCV 2021

Contrast and Classify: Alternate Training for Robust VQA

Contrast and Classify: Alternate Training for Robust VQA

Y. Kant, A. Moudgil, D. Batra, D. Parikh, H. Agrawal

ICCV 2021

Spatially Aware Multimodal Transformers for TextVQA

Spatially Aware Multimodal Transformers for TextVQA

Y. Kant, D. Batra, P. Anderson, A. Schwing, D. Parikh, J. Lu, H. Agrawal

ECCV 2020

Sequential Latent Spaces for Modeling the Intention During Diverse Image Captioning

Sequential Latent Spaces for Modeling the Intention During Diverse Image Captioning

J. Aneja*, H. Agrawal*, D. Batra, A. Schwing

ICCV 2019

nocaps: novel object captioning at scale

nocaps: novel object captioning at scale

H. Agrawal*, K. Desai*, Y. Wang, X. Chen, R. Jain, M. Johnson, D. Batra, D. Parikh, S. Lee, P. Anderson

ICCV 2019

Sort Story: Sorting Jumbled Images and Captions into Stories

Sort Story: Sorting Jumbled Images and Captions into Stories

H. Agrawal*, A. Chandrasekaran*, D. Batra, D. Parikh, M. Bansal

EMNLP 2016

Human Attention in Visual Question Answering: Do Humans and Deep Networks Look at the Same Regions?

Human Attention in Visual Question Answering: Do Humans and Deep Networks Look at the Same Regions?

A. Das*, H. Agrawal*, C. L. Zitnick, D. Parikh, D. Batra

Computer Vision and Image Understanding (CVIU) 2017
EMNLP 2016
ICML 2016 Workshop on Visualization for Deep Learning (Best Student Paper)

Object-Proposal Evaluation Protocol is 'Gameable'

Object-Proposal Evaluation Protocol is 'Gameable'

N. Chavali*, H. Agrawal*, A. Mahendru*, D. Batra

CVPR 2016 (Spotlight)

EvalAI: Towards Better Evaluation Systems for AI Agents

EvalAI: Towards Better Evaluation Systems for AI Agents

D. Yadav, R. Jain, H. Agrawal, P. Chattopadhyay, T. Singh, A. Jain, S. B. Singh, S. Lee, D. Batra

AI Systems Workshop (SOSP 2019)

CloudCV: Large Scale Distributed Computer Vision as a Cloud Service

CloudCV: Large Scale Distributed Computer Vision as a Cloud Service

H. Agrawal, C. S. Mathialagan, Y. Goyal, N. Chavali, P. Banik, A. Mohapatra, A. Osman, D. Batra

Book Chapter: Mobile Cloud Visual Media Computing, 265-290

Fabrik: An Online Collaborative Neural Network Editor

Fabrik: An Online Collaborative Neural Network Editor

U. Garg, V. Prabhu, D. Yadav, R. Ramrakhya, H. Agrawal, D. Batra

AI Systems Workshop (SOSP 2019)