DEV Community

Condé Nast Italy
Condé Nast Italy

Posted on • Edited on

When Food meets AI: the Smart Recipe Project

Part 1: Cleaning and manipulating food data

Cooking recipes, videos, photos are everywhere on the web, which is today the greatest archive of food-related content.
But ​what if this big amount of data meets Artificial Intelligence? We in the Smart Recipe Project answered the question of developing systems able to interpret and extract information from food recipes.
Are you wondering how?

The project step-by-step:

  1. using NLP techniques, we enriched data, labeling entities and adding entity-specific information;
  2. exploiting state of the art ML and DL models, we developed services able to automatically extract information from recipes;
  3. adopting the Amazon Neptune technology, we built graph databases to store and navigate relationships among data.

But first... we collected and cleaned the data.

Data Extraction

Using Python and its text manipulation libraries, we extracted recipes from tsv databases:

import pandas as pd
def data_extractor(df_content, df_ingredients, df_steps, start, dim): list_cell = []
   for n, cell in enumerate(df_content[start:start+dim]): 
      if str(cell) != 'nan':
         list_cell.append((start+n, cell)) 
      else:
         list_cell.append((start+n, df_ingredients[n] + '\n' +df_steps[n])) 
   return list_cell

Data Cleaning

Then cleaned them with a couple of regex:

def clean_recipe(recipe, regex_list):
   for (regex1, regex2, ...) in regex_list:
      recipe = re.sub(regex1, regex2, text.lower()) 
   return recipe

Data preprocessing

Finally, we 1) tokenized and 2) pos tagged the data with NLTK:

import nltk

def tokenize(recipe):
   sentences = nltk.sent_tokenize(recipe, language="English") 
   tokens = []
   [tokens.append(nltk.MWETokenizer(sentence, "english")) for
sentence in sentences ] 
   return tokens
def pos_tagger(recipe): 
   tagged_tokens=[]
   tokenized_text = tokenize(clean_recipe(recipe, regex_list))
   tagged_tokens = [[ str(tag_token[0]).lower() + "\t" + str(tag_token[1]) 
   for tag_token in nltk.pos_tag(tokens)] for tokens in tokenized_text ]
   return tagged_tokens

Curious about the output? Go on Medium to read the complete article and find out more about the most appetizing stages of our work.


When Food meets AI: the Smart Recipe Project
a series of 6 amazing articles

Table of contents

Part 1: Cleaning and manipulating food data
Part 1: A smart method for tagging your datasets
Part 2: NER for all tastes: extracting information from cooking recipes
Part 2: Neither fish nor fowl? Classify it with the Smart Ingredient Classifier
Part 3: FoodGraph: a graph database to connect recipes and food data
Part 3. FoodGraph: Loading data and Querying the graph with SPARQL

Top comments (0)