Manaal Faruqui
Senior Staff Research Scientist, Manager, Google Bard
mfaruqui google com
I am a research scientist at Google working on Google Bard, where I lead a team of research scientists and engineers working on the model quality of Bard spanning factuality (producing factful responses), instruction following, and general quality management of the training mixture.
@article{faruqui-tur-2022,
title = "{Revisiting the Boundary between ASR and NLU in the Age of Conversational Dialog Systems}",
author = "Faruqui, Manaal and Hakkani-Tür, Dilek",
journal = "Computational Linguistics",
volume = {1}
year = "2022"
}
@inproceedings{qin-etal-2021-timedial,
title = "{TimeDial: Temporal Commonsense Reasoning in Dialog}",
author = "Qin, Lianhui and Gupta, Aditya and Upadhyay, Shyam and He, Luheng and Choi, Yejin and Faruqui, Manaal",
booktitle = "Proc. of ACL",
year = "2021"
}
@inproceedings{gupta-etal-2021-disflqa,
title = "{Disfl-QA: A Benchmark Dataset for Understanding Disfluencies in Question Answering}",
author = "Gupta, Aditya and Xu, Jiacheng and Upadhyay, Shyam and Yang, Diyi and Faruqui, Manaal",
booktitle = "Findings of ACL",
year = "2021"
}
@inproceedings{parikh2020totto,
title={{ToTTo}: A Controlled Table-To-Text Generation Dataset},
author={Parikh, Ankur P and Wang, Xuezhi and Gehrmann, Sebastian and Faruqui, Manaal and Dhingra, Bhuwan and Yang, Diyi and Das, Dipanjan},
booktitle={Proceedings of EMNLP},
year={2020}
}
@inproceedings{questions:aaai20,
author = {Chu, Zewei and Chen, Mingda and Chen, Jing and Wang, Miaosen and Gimpel, Kevin and Faruqui, Manaal and Si, Xiance}
title = {How to Ask Better Questions? A Large-Scale Multi-Domain Dataset for Rewriting Ill-Formed Questions},
booktitle = {Proc. of AAAI},
year = {2020},
}
@inproceedings{parent:acl19,
author = {Dhingra, Bhuwan and Faruqui, Manaal and Parikh, Ankur and Chang, Ming-Wei and Das, Dipanjan and Cohen, William},
title = {Handling Divergent Reference Texts when Evaluating Table-to-Text Generation},
booktitle = {Proc. of ACL},
year = {2019},
}
@inproceedings{split:emnlp18,
author = {Botha, Jan and Faruqui, Manaal and Alex, John and Baldridge, Jason and Das, Dipanjan},
title = {Learning To Split and Rephrase From Wikipedia Edit History},
booktitle = {Proc. of EMNLP},
year = {2018},
}
Split and rephrase is the task of breaking down a sentence into shorter ones that together convey the same meaning.
We extract a rich new dataset for this task by mining Wikipedia's edit history: WikiSplit contains one million
naturally occurring sentence rewrites, providing sixty times more distinct split examples and a ninety times larger
vocabulary than the WebSplit corpus introduced by Narayan et al. (2017) as a benchmark for this task. Incorporating
WikiSplit as training data produces a model with qualitatively better predictions that score 32 BLEU points above
the prior best result on the WebSplit benchmark.
@inproceedings{faruqui:emnlp18,
author = {Faruqui, Manaal and Das, Dipanjan},
title = {Identifying Well-formed Natural Language Questions},
booktitle = {Proc. of EMNLP},
year = {2018},
}
Understanding search queries is a hard problem as it involves dealing with "word salad"
text ubiquitously issued by users. However, if a query resembles a well-formed question,
a natural language processing pipeline is able to perform more accurate interpretation,
thus reducing downstream compounding errors. Hence, identifying whether or not a
query is well formed can enhance query understanding. Here, we introduce a new task
of identifying a well-formed natural language question. We construct and release a dataset
of 25,100 publicly available questions classified into well-formed and non-wellformed categories
and report an accuracy of 70.7% on the test set. We also show that our classifier can
be used to improve the performance of neural sequence-to-sequence models for generating
questions for reading comprehension.
@inproceedings{pavlick:emnlp18,
author = {Faruqui, Manaal and Pavlick, Ellie and Tenney, Ian and Das, Dipanjan},
title = {WikiAtomicEdits: A Multilingual Corpus of Wikipedia Edits for Modeling Language and Discourse},
booktitle = {Proc. of EMNLP},
year = {2018},
}
We release a corpus of atomic insertion edits: instances in which a human editor has inserted a single
contiguous span of text into an existing sentence. Our corpus is derived from Wikipedia edit history
and contains 43 million sentences across 8 different languages. We argue that the signal contained in
these edits is valuable for research in semantics and dis-course, and that such signal differs from
that found in conventional language modeling corpora. We provide experimental evidence from both a
corpus linguistics and a language modeling perspective to support these claims.
@inproceedings{upadhyay:icassp18,
author = {Shyam Upadhyay, Faruqui, Manaal and Tur, Gokhan and Hakkani-Tur, Dilek and Heck, Larry},
title = {(Almost) Zero-short Cross-lingual Spoken Language Understanding},
booktitle = {Proc. of ICASSP},
year = {2018},
}
Spoken language understanding (SLU) is a component of goal-oriented dialogue
systems that aims to interpret user's natural language queries in system's
semantic representation format. While current state-of-the-art SLU approaches
achieve high performance for English domains, the same is not true for other
languages. Approaches in the literature for extending SLU models and grammars
to new languages rely primarily on machine translation. This poses a challenge
in scaling to new languages, as machine translation systems may not be reliable
for several (especially low resource) languages. In this work, we examine
different approaches to train a SLU component with little supervision for two
new languages -- Hindi and Turkish, and show that with only a few hundred labeled
examples we can surpass the approaches proposed in the literature. Our experiments
show that training a model bilingually (i.e., jointly with English), enables
faster learning, in that the model requires fewer labeled instances in the target
language to generalize. Qualitative analysis shows that rare slot types benefit
the most from the bilingual training.
@inproceedings{repeval:16,
author = {Faruqui, Manaal and Tsvetkov, Yulia and Rastogi, Pushpendre and Dyer, Chris},
title = {Problems With Evaluation of Word Embeddings Using Word Similarity Tasks},
booktitle = {Proc. of the 1st Workshop on Evaluating Vector Space Representations for NLP},
year = {2016},
url = {http://arxiv.org/pdf/1605.02276v1.pdf},
}
Lacking standardized extrinsic evaluation methods for vector representations
of words, the NLP community has relied heavily on word similarity tasks as a
proxy for intrinsic evaluation of word vectors. Word similarity evaluation,
which correlates the distance between vectors and human judgments of semantic
similarity is attractive, because it is computationally inexpensive and fast.
In this paper we present several problems associated with the evaluation of
word vectors on word similarity datasets, and summarize existing solutions.
Our study suggests that the use of word similarity tasks for evaluation of
word vectors is not sustainable and calls for further research on evaluation
methods.
@phdthesis{faruquithesis,
author = {Faruqui, Manaal},
title = {Diverse Context for Learning Word Representations},
school = {Carnegie Mellon University},
year = 2016,
}
@inproceedings{bicompare:16,
author = {Upadhyay, Shyam and Faruqui, Manaal and Dyer, Chris and Roth, Dan},
title = {Cross-lingual Models of Word Embeddings: An Empirical Comparison},
booktitle = {Proc. of ACL},
year = {2016},
}
Despite interest in using cross-lingual knowledge to learn word embeddings for
various tasks, a systematic comparison of the possible approaches is lacking in the
literature. We perform an extensive evaluation of four popular approaches of inducing
cross-lingual embeddings, each requiring a different form of supervision, on four
typographically different language pairs. Our evaluation setup spans four different
tasks, including intrinsic evaluation on mono-lingual and cross-lingual similarity,
and extrinsic evaluation on downstream semantic and syntactic applications. We
show that models which require expensive cross-lingual knowledge almost always
perform better, but cheaply supervised models often prove competitive on certain tasks.
@article{TACL730,
author = {Faruqui, Manaal and McDonald, Ryan and Soricut, Radu},
title = {Morpho-syntactic Lexicon Generation Using Graph-based Semi-supervised Learning},
journal = {Transactions of the Association for Computational Linguistics},
volume = {4},
year = {2016},
ssn = {2307-387X},
url = {https://tacl2013.cs.columbia.edu/ojs/index.php/tacl/article/view/730},
pages = {1--16}
}
Morpho-syntactic lexicons provide information about the morphological and
syntactic roles of words in a language. Such lexicons are not available for all
languages and even when available, their coverage can be limited.We present a
graph-based semi-supervised learning method that uses the morphological,
syntactic and semantic relations betweenwords to automatically construct wide
coverage lexicons from small seed sets. Our method is language-independent, and
we show that we can expand a 1000 word seed lexicon to more than 100 times its
size with high quality for 11 languages. In addition, the automatically created
lexicons provide features that improve performance in two downstream tasks:
morphological tagging and dependency parsing.
@inproceedings{faruqui:2016:infl,
author = {Faruqui, Manaal and Tsvetkov, Yulia and Neubig, Graham and Dyer, Chris},
title = {Morphological Inflection Generation Using Character Sequence to Sequence Learning},
booktitle = {Proc. of NAACL},
year = {2016},
}
Morphological inflection generation is the task of generating the inflected
form of a given lemma corresponding to a particular linguistic transformation.
We model the problem of inflection generation as a character sequence to
sequence learning problem and present a variant of the neural encoder-decoder
model for solving it. Our model is language independent and can be trained in
both supervised and semi-supervised settings. We evaluate our system on seven
datasets of morphologically rich languages and achieve either better or
comparable results to existing state-of-the-art models of inflection generation.
@InProceedings{tsvetkov:2015:eval,
author = {Tsvetkov, Yulia and Faruqui, Manaal and Ling, Wang and Lample, Guillaume and Dyer, Chris},
title = {Evaluation of Word Vector Representations by Subspace Alignment},
booktitle = {Proc. of EMNLP},
year = {2015}
}
Unsupervisedly learned word vectors have proven to provide exceptionally
effective features in many NLP tasks. Most common intrinsic evaluations of
vector quality measure correlation with similarity judgments. However, these
often correlate poorly with how well the learned representations perform as
features in downstream evaluation tasks. We present QVEC--a computationally
inexpensive intrinsic evaluation measure of the quality of word embeddings
based on alignment to a matrix of features extracted from manually crafted
lexical resources--that obtains strong correlation with performance of the
vectors in a battery of downstream semantic evaluation tasks.
@InProceedings{faruqui:2015:sparse,
author = {Faruqui, Manaal and Tsvetkov, Yulia and Yogatama, Dani and Dyer, Chris and Smith, Noah A.},
title = {Sparse Overcomplete Word Vector Representations},
booktitle = {Proc. of ACL},
year = {2015},
}
Current distributed representations of words show little resemblance to theories
of lexical semantics. The former are dense and uninterpretable, the latter
largely based on familiar, discrete classes (e.g., supersenses) and relations
(e.g., synonymy and hypernymy). We propose methods that transform word vectors
into sparse (and optionally binary) vectors. The resulting representations are
more similar to the interpretable features typically used in NLP, though they
are discovered automatically from raw corpora. Because the vectors are highly
sparse, they are computationally easy to work with. Most importantly, we find
that they outperform the original vectors on benchmark tasks.
@InProceedings{faruqui:2015:non-dist,
author = {Faruqui, Manaal and Dyer, Chris},
title = {Non-distributional Word Vector Representations},
booktitle = {Proc. of ACL},
year = {2015},
}
Data-driven representation learning for words is a technique of central
importance in NLP. While indisputably useful as a source of features in
downstream tasks, such vectors tend to consist of uninterpretable components
whose relationship to the categories of traditional lexical semantic theories
is tenuous at best. We present a method for constructing interpretable word
vectors from hand-crafted linguistic resources like WordNet, FrameNet etc.
These vectors are binary (i.e, contain only 0 and 1) and are 99.9% sparse. We
analyze their performance on state-of-the-art evaluation methods for
distributional models of word vectors and find they are competitive to standard
distributional approaches.
@InProceedings{faruqui:2015:relation,
author = {Faruqui, Manaal and Kumar, Shankar},
title = {Multilingual Open Relation Extraction Using Cross-lingual Projection},
booktitle = {Proc. of NAACL},
year = {2015}
}
Open domain relation extraction systems identify relation and argument phrases
in a sentence without relying on any underlying schema. However, current
state-of-the-art relation extraction systems are available only for English
because of their heavy reliance on linguistic tools such as part-of-speech
taggers and dependency parsers. We present a cross-lingual annotation
projection method for language independent relation extraction. We evaluate our
method on a manually annotated test set and present results on three
typologically different languages. We release these manual annotations and
extracted relations in 61 languages from Wikipedia.
@InProceedings{faruqui:2015:Retro,
author = {Faruqui, Manaal and Dodge, Jesse and Jauhar, Sujay K. and Dyer, Chris and Hovy, Eduard and Smith, Noah A.},
title = {Retrofitting Word Vectors to Semantic Lexicons},
booktitle = {Proc. of NAACL},
year = {2015}
}
Vector space word representations are learned from distributional information
of words in large corpora. Although such statistics are semantically
informative, they disregard the valuable information that is contained in
semantic lexicons such as WordNet, FrameNet, and the Paraphrase Database. This
paper proposes a method for refining vector space representations using
relational information from semantic lexicons by encouraging linked words to
have similar vector representations, and it makes no assumptions about how the
input vectors were constructed. Evaluated on a battery of standard lexical
semantic evaluation tasks in several languages, we obtain substantial
improvements starting with a variety of word vector models. Our refinement
method outperforms prior techniques for incorporating semantic lexicons into
word vector training algorithms.
@InProceedings{faruqui-2014:SystemDemo,
author = {Faruqui, Manaal and Dyer, Chris},
title = {Community Evaluation and Exchange of Word Vectors at wordvectors.org},
booktitle = {Proc. of ACL: System Demonstrations},
year = {2014},
}
Vector space word representations are useful for many natural language
processing applications. The diversity of techniques for computing vector
representations and the large number of evaluation benchmarks makes reliable
comparison a tedious task both for researchers developing new vector space
models and for those wishing to use them. We present a website and suite of
offline tools that that facilitate evaluation of word vectors on standard
lexical semantics benchmarks and permit exchange and archival by users who wish
to find good vectors for their applications. The system is accessible at:
www.wordvectors.org
@InProceedings{faruqui-dyer:2014:EACL,
author = {Faruqui, Manaal and Dyer, Chris},
title = {Improving Vector Space Word Representations Using Multilingual Correlation},
booktitle = {Proc. of EACL},
year = {2014},
}
The distributional hypothesis of Harris (1954), according to which the meaning
of words is evidenced by the contexts they occur in, has motivated several
effective techniques for obtaining vector space semantic representations of
words using unannotated text corpora. This paper argues that lexico-semantic
content should additionally be invariant across languages and proposes a simple
technique based on canonical correlation analysis (CCA) for incorporating
multilingual evidence into vectors generated monolingually. We evaluate the
resulting word representations on standard lexical semantic evaluation tasks
and show that our method produces substantially better semantic representations
than monolingual techniques.
@InProceedings{faruqui-dyer:2013:Short,
author = {Faruqui, Manaal and Dyer, Chris},
title = {An Information Theoretic Approach to Bilingual Word Clustering},
booktitle = {Proc. of ACL},
year = {2013},
}
We present an information theoretic objective for bilingual word clustering
that incorporates both monolingual distributional evidence as well as
cross-lingual evidence from parallel corpora to learn high quality word
clusters jointly in any number of languages. The monolingual component of our
objective is the average mutual information of clusters of adjacent words in
each language, while the bilingual component is the average mutual information
of the aligned clusters. To evaluate our method, we use the word clusters in an
NER system and demonstrate a statistically significant improvement in F1 score
when using bilingual word clusters instead of monolingual clusters.
@InProceedings{faruqui-pado:2012:EACL2012,
author = {Faruqui, Manaal and Pado, Sebastian},
title = {Towards a model of formal and informal address in English},
booktitle = {Proc. of EACL},
year = {2012},
}
Informal and formal (“T/V”) address in dialogue is not distinguished overtly in
modern English, e.g. by pronoun choice like in many other languages such as
French (“tu”/“vous”). Our study investigates the status of the T/V distinction
in English literary texts. Our main findings are: (a) human raters can label
monolingual English utterances as T or V fairly well, given sufficient context;
(b), a bilingual corpus can be exploited to induce a supervised classifier for
T/V without human annotation. It assigns T/V at sentence level with up to 68%
accuracy, relying mainly on lexical features; (c), there is a marked asymmetry
between lexical features for formal speech (which are conventionalized and
therefore general) and informal speech (which are text-specific).
@inproceedings{faruqui2010training,
title={Training and Evaluating a German Named Entity Recognizer with Semantic Generalization},
author={Faruqui, Manaal and Pad{\'o}, Sebastian},
booktitle={Proc. of KONVENS 2010},
year={2010},
}
We present a freely available optimized Named Entity Recognizer (NER) for
German. It alleviates the small size of available NER training corpora for
German with distributional generalization features trained on large unlabelled
corpora. We vary the size and source of the generalization corpus and find
improvements of 6% F1 score (in-domain) and 9% (out-of-domain) over simple
supervised training.