Yummly recipe dataset – clustering asian cuisines

Imagine you want to know whether Korean cuisine is more similar to Chinese or Japanese (or perhaps Thai?). And what makes it unique? Answering these questions and many more is easy when you have thousands of recipes of cusines from all around the world. This is the Yummly recipe dataset.

Each recipe comes as a list of ingredients and a label corresponding to the cuisine it is from. The names of the ingredients may be a bit messy, as some people think in terms of garlic and some others of garlic cloves. Are these the same or not the same?

Fortunately, this kind of problem can be ignored if we stick to the most common ingredients (sugar, soy sauce …)

Clustering: which cuisines are similar?

I considered only the four most frequent ingredients of each cuisine: the four distinct ingredients that appear most frequently in Japanese cuisine, in French cuisine, in Mexican cuisine ….

Ingredients often appear more than once (for example, salt is the most used ingredient in all of western cuisine) so the list is not as long as it sounds.

For every ingredient I measured the fraction of recipes in which it appears, for a given cuisine. This results in a table where every row is a cuisine and every column is an ingredient.

I then run agglomerative cluster analysis using agnes from the cluster library in R. Here is the result:
Dendrogram of world cuisines

There is a distinct east-asian cluster and two western clusters: the continental european and the mediterranean. Somewhere in-between Indian, Philipino, and Jamaican (!) cuisine end up together.

Maybe we will gain more insight by looking at the geographical clusters more closely.

Clustering east and south-east asian cuisines

This time I took only the three most frequent ingredients by cusine. This way, by removing the duplicates, we are left with six ingredients, which is a good number to show on a spider plot.

For each cuisine, which fraction of the recipes uses a given ingredient? Here is how each cuisine looks like in ingredient space:

Note the immediate similarity of Thai and Vietnamese cusine, and how Korean and Chinese resemble each other, but not as closely.

Here is the dendrogram (again, agglomerative clustering with agnes but only on these ingredients):

So, based exclusively on the frequency of use of the most frequent ingredients (as recorded by international users of a US-based English-language website, while glossing over details such as cooking times and methods, ingredient quantities and how the ingredients are combined and prepared, regional differences in ingredients names, and more), Korean cuisine is most similar to Chinese. These two are more similar to Japanese cuisine than they are to other Asian cuisines.

Mapping ingredients

A darker shade of green means that an ingredient is more used in a given area. Gray represents countries that do not appear in the dataset. Colors are assigned so that the darkest shade is always used (country ingredient usages are normalized to the maximum with a linear transformation).

The Philippines lead in salt and garlic, with Korea a close second in the latter, while the Japanese do not enjoy garlic so much.

Soy and fish sauce separate south-east Asia from the northern part of the continent. In China, Japan and Korea, fish sauce use is practically non-existent, while in Thailand, Vietnam and the Philippines soy sauce is not so popular.

Korea is the regional leader in sesame oil, but loses the onion race with the Philippines.