When you're working with machine learning, deciding which algorithm performs best across multiple datasets can be quite challenging. Simply looking at performance metrics might not be enough, you need statistical methods to be sure. That's when the Friedman Test and Critical Difference (CD) Diagrams can help.
My classmates and I faced this challenge firsthand when preparing a project presentation. We struggled to find a clear way to generate the diagram, so after finally figuring it out, I decided to share this guide to save others time.
In this article, you'll find a Python code that performs this evaluation and visualization. You can also access the complete code on my GitHub gist.
I'll also show you how to modify the code to use accuracy instead of the error rate. The script has been tested on Python 3.8 and above.
The Python script does three main things:
- Performs the Friedman Test to statistically evaluate performance differences.
- Creates a ranking table comparing the algorithm scores.
- Generates and saves a PNG image of the Critical Difference Diagram and the ranking table.
Critical Difference Diagram generated
In the diagram, algorithms connected by a horizontal bar are not significantly different from each other based on the statistical test. Lower-ranked algorithms (positioned further right) generally performed better. The ranking remains the same whether using error rate or accuracy as the performance metric.
Ranking table generated
This table shows the error rates of each algorithm across all datasets. Each cell contains the error rate along with its ranking in parentheses (where 1 is the best and 13 is the worst, since I used 12 datasets).
At the bottom, you’ll find the rank sums and average rankings for each algorithm, for better overall comparison.
I edited the original image and removed some columns for better readability here.
Why Use the Friedman Test and Critical Difference Diagram?
The Friedman test is a non-parametric statistical test designed to detect differences between multiple algorithms across various datasets. It ranks algorithms based on their performance, helping you understand if differences in performance are genuinely significant or just due to chance.
The Critical Difference Diagram visually presents these rankings. It clearly shows which algorithms perform similarly and which are significantly better or worse, making it easy to interpret results at a glance. This diagram is particularly useful when comparing numerous algorithms across multiple datasets.
Preparing Your Data
Now for the implementation, you'll need your data structured like this:
Datasets
: Names of your datasets (e.g., MNIST, Fashion-MNIST).
Algorithms
: Names of the algorithms you're evaluating. Keep these ordered consistently.
Performance (Error)
: Lists of error rates for each algorithm per dataset, aligned with your Algorithms
list.
For example:
data = {
'Datasets': ['MNIST', 'Fashion-MNIST', ...],
'Algorithms': ['NaiveBayes', 'IBk', ..., 'RandomForest'],
'Performance (Error)': [
['30.34%', '3.09%', ..., '88.65%'], # MNIST
['36.72%', '14.35%', ..., '90%'], # Fashion-MNIST
# Other datasets...
]
}
Make sure the error rates are listed in the same order as their corresponding algorithms. For example, if NaiveBayes is the first algorithm in the list, its performance values should always appear first in each dataset's row.
If you prefer to use accuracy instead of error rates, you can either replace the values in the Performance field with accuracy scores or simply subtract the error rates from 1. I'll also demonstrate how to do this right after the implementation.
Python Implementation
Here's the Python code, also available in this gist on GitHub:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import friedmanchisquare
from aeon.visualisation import plot_critical_difference
data = {
'Datasets': [
'MNIST', 'Fashion-MNIST',
'e1-Spiral', 'e1-Android',
'e2-andorinhas', 'e2-chinese',
'e3-user', 'e3-ecommerce',
'e4-wine', 'e4-heart',
'e5-mamiferos', 'e5-titanic'
],
'Algorithms': [
'NaiveBayes', 'IBk', 'J48', 'RandomForest', 'LMT',
'XGBoost', 'SVM', 'LGBM', 'Bagging', 'AdaBoost',
'KStar', 'M5P', 'MLP', 'HC', 'E-M'
],
'Performance (Error)': [
[ # MNIST -ok
'30.34%', '3.09%', '10.67%', '3.51%', '5.70%',
'2.05%', '2.61%', '2.26%', '5.07%', '11.74%',
'89.80%', '47.31%', '0%', '44.96%', '88.65%'
],
[ # Fashion-MNIST -ok
'36.72%', '14.35%', '18.27%', '11.92%', '13.62%',
'8.58%', '9.47%', '9.50%', '12.20%', '19.00%',
'90%', '46.78%', '0.52%', '51.45%', '90%'
],
[ # e1-Spiral -ok
'29.125%', '0.38%', '2.25%', '1.75%', '2.375%',
'3.12%', '1.88%', '3.12%', '0%', '4.37%',
'5.51%', '1.43%', '0%', '49.75%', '72.50%'
],
[ # e1-Android -ok
'8.1317%', '7.7285%', '4.7491%', '4.5475%', '4.3683%',
'4.03%', '4.37%', '3.47%', '6.38%', '5.60%',
'8.94%', '7.61%', '1.67%', '49.98%', '38.95%'
],
[ '8.1%', '6.65%', '5.60%', '4.90%', '4.60%',
'4.00%', '3.75%', '4.25%', '3.50%', '4.75%',
'4.36%', '2.85%', '3.92%', '48.60%', '49.25%'
],
[ # e2-chinese -ok
'27.1589%', '12.8911%', '34.9186%', '7.5094%', '10.6383%',
'7.50%', '6.25%', '5.63%', '16.25%', '34.38%',
'0%', '1.21%', '0%', '87.36%', '78.22%'
],
[ '0%', '4.8571%', '0%', '0%', '0%', '0.1429%',
'2.14%', '0%', '0%', '0%', '0%', '0%',
'0%', '0.39%', '0%', '79.14%', '4.57%'
],
[ # e3-ecommerce -ok
'11.37%', '11.15%', '2.39%', '2.07%', '2.42%',
'0.90%', '8.80%', '0.70%', '10.35%', '2.85%',
'0.02%', '7.56%', '3.96%', '22.11%', '41.49%'
],
[ # e4-wine -ok
'44.96%', '35.21%', '38.59%', '29.89%', '39.65%',
'48.95%', '56.56%', '46.85%', '43.94%', '50.99%',
'39.23%', '50.82%', '36.51%', '57.34%', '77.98%'
],
[ # e4-heart -ok
'43.51%', '46.61%', '35.82%', '37.20%', '35.88%',
'45.71%', '34.51%', '45.73%', '44.16%', '46.1%',
'46.15%', '64.18%', '49.22%', '88.1962%', '69.94%'
],
[ # e5-mamiferos -ok
'0%', '0%', '0%', '0%', '0%', '0%',
'0%', '0%', '0%', '0%', '0%', '0%',
'1.57%', '0%', '0.20%', '31.20%', '44.80%'
],
[ # e5-titanic -ok
'21.3244%', '22.7834%', '22.5589%', '28.3951%', '19.7531%',
'16.76%', '27.56%', '14.85%', '7.48%', '10.79%',
'27.18%', '61.62%', '26.76%', '38.16%', '38.50%'
]
]
}
# Convert the data into a DataFrame
datasets = data['Datasets']
algorithms = data['Algorithms']
performance_data = data['Performance (Error)']
# Create a list of dictionaries for each dataset
rows = []
for dataset, performance in zip(datasets, performance_data):
row = {'Dataset': dataset}
row.update({alg: perf for alg, perf in zip(algorithms, performance)})
rows.append(row)
# Create the DataFrame
df = pd.DataFrame(rows)
# Convert string percentages to floats
for alg in algorithms:
df[alg] = df[alg].str.replace(',', '.').str.rstrip('%').astype(float) / 100
# Calculate the ranking of each algorithm for each dataset
rankings_matrix = df[algorithms].rank(axis=1, method='min', ascending=True)
# Format the results
formatted_results = df[algorithms].copy()
for col in formatted_results.columns:
formatted_results[col] = formatted_results[col].round(3).astype(str) + " (" + rankings_matrix[col].astype(int).astype(str) + ")"
# Add a row for the sum of ranks and average of ranks
sum_ranks = rankings_matrix.sum().round(3).rename('Sum Ranks')
average_ranks = rankings_matrix.mean().round(3).rename('Average Ranks')
# Add the rows to the formatted DataFrame using concat
formatted_results = pd.concat([formatted_results, sum_ranks.to_frame().T, average_ranks.to_frame().T])
# Add the 'Dataset' column to the formatted DataFrame
formatted_results.insert(0, 'Dataset', df['Dataset'].tolist() + ['Sum Ranks', 'Average Ranks'])
# Display the table
print("Error Table (%) with Ranking:")
print(formatted_results)
# Save the formatted table as an image
fig, ax = plt.subplots(figsize=(14, 8))
ax.axis('tight')
ax.axis('off')
table = ax.table(cellText=formatted_results.values, colLabels=formatted_results.columns, cellLoc='center', loc='center')
table.auto_set_font_size(False)
table.set_fontsize(12)
table.scale(2.5, 2.5)
plt.subplots_adjust(left=0.2, bottom=0.2, right=0.8, top=1, wspace=0.2, hspace=0.2)
plt.savefig('table_with_rankings.png', format="png", bbox_inches="tight", dpi=300)
plt.show()
print("Table saved as 'table_with_rankings.png'")
# Perform the Friedman Test
friedman_stat, p_value = friedmanchisquare(*rankings_matrix.T.values)
print(f"Friedman test statistic: {friedman_stat}, p-value = {p_value}")
# Convert the accuracy matrix into a NumPy array for the critical difference diagram
scores = df[algorithms].values
classifiers = df[algorithms].columns.tolist()
print("Algorithms:", classifiers)
print("Errors:", scores)
# Set the figure size before plotting
plt.figure(figsize=(16, 12)) # Adjust the figure size as needed
# Generate the critical difference diagram
plot_critical_difference(
scores,
classifiers,
lower_better=True,
test='wilcoxon', # or nemenyi
correction='holm', # or bonferroni or none
)
# Get the current axes
ax = plt.gca()
# Adjust font size and rotation of x-axis labels
for label in ax.get_xticklabels():
label.set_fontsize(14)
label.set_rotation(45)
label.set_horizontalalignment('right')
# Increase padding between labels and axis
ax.tick_params(axis='x', which='major', pad=20)
# Adjust margins to provide more space for labels
plt.subplots_adjust(bottom=0.35)
# Optionally adjust y-axis label font size
ax.tick_params(axis='y', labelsize=12)
# Save and display the plot
plt.savefig('critical_difference_diagram.png', format="png", bbox_inches="tight", dpi=300)
plt.show()
Changing to accuracy instead of error
The code gist already includes a full working script that uses accuracy as the performance metric.
If you're already using the error percentage in the data
variable, you can add this last line to convert the values to accuracy:
# Convert string percentages to floats
for alg in algorithms:
df[alg] = df[alg].str.replace(",", ".").str.rstrip("%").astype(float) / 100
# Convert to accuracy
df[alg] = 1 - df[alg]
Then change the rankings_matrix
to set the rank ascending
value to False
:
# Calculate the ranking of each algorithm for each dataset (higher accuracy = better rank)
rankings_matrix = df[algorithms].rank(axis=1, method="min", ascending=False)
Finally, when plotting the Critical Difference diagram, change the lower_better
value to False
.
plot_critical_difference(
scores,
classifiers,
lower_better=False, # False for accuracy (higher is better)
test="wilcoxon",
correction="holm",
)
Conclusion
I hope this guide helps someone out :) If you have any suggestions or questions, feel free to leave a comment or reach out, and I'll do my best to get back to you!
Top comments (1)
Hello there, I have a question. I have F1-score values of 6 algorithms, while datasets are 10 in total. I am using your code, but I have a few questions. When I increase or decrease the size of the figure, I can't; it just gets stuck.