Lucas Bandarkar

UCLA — Ph.D. Student

Machine Learning, Natural Language Processing

Twitter

Semantic Scholar

Google Scholar

Summary

I'm a first-year A.I. Ph.D. student in the Computer Science department at UCLA. I'm advised by Violet Peng in the PLUS Lab.

Before this, I spent over two years as a data scientist at Meta/Facebook AI working on large-scale multilingual NLP. There, I generally focused on model evaluation, resource creation & data annotation, and global language strategy for a suite of production models (including machine translation, language identification, and text embeddings). Notably, I led the development of the Belebele dataset (GitHub, HuggingFace) and associated LLM evaluation paper. During my undergrad at UC Berkeley, I worked in Marti Hearst's NLP lab under the mentorship of Philippe Laban.

Research Interests

multi-/cross-lingual text representations: modular & "language-agnostic" representations, language adaptation & cross-lingual transfer, tokenization & model vocabulary
multilingual evaluation: data annotation & resource creation, embeddings evaluation, translation evaluation
multilingual training data: data quality evaluation & filtering, language identification, data balancing

Applications: multilingual embeddings, LLM language adaptation, LMs in low-resource languages, language identification, machine translation

Employment

(upcoming) Research Intern, Meta AI

Research Data Scientist, Meta AI

Aug 2021 - Sep 2023

(Data Scientist from Aug 2021 - Nov 2022)

machine translation, language identification, multilingual text embeddings, multilingual optical character recognition, Arabic dialect identification

Data Scientist Intern, Meta AI

May 2020 - Aug 2020

optical character recognition

Education

(in progress) Ph.D. in Computer Science, UCLA

Sep 2023 - current

B.A. in Statistics, Data Science, UC Berkeley

Aug 2017 - May 2021