COMP3931 Individual Project · University of Leeds · 2025/26

AI Misuse Evaluator

A free, open tool that scores any AI system out of 100 – so anyone can understand whether an AI tool is truly justified, regardless of their technical background.

What is this tool and why does it exist?

Artificial intelligence (AI) is now used everywhere – in hospitals, schools, social media, policing, and creative industries. But most people have no way of knowing whether a particular AI tool is actually good for society, or whether it causes harm. Experts disagree, the science is complex, and companies rarely publish honest assessments of their own products.

This tool solves that problem by applying a scientific scoring method to 40 real AI systems. It measures five things that matter – environmental impact, ethical risk, creative displacement, purpose, and transparency – and combines them into a single Justifiability Score from 0 to 100, along with an A–F grade (just like an EU energy rating on a fridge). The higher the score, the more justified the AI system is. The lower the score, the more reasons there are to question whether it should exist.

You don't need to know anything about AI or data science to use this tool. Everything is explained in plain language.

▶ How does the scoring work? (click to expand) ▼

We collect data about an AI system across 5 dimensions, things like its carbon footprint and whether it's been used harmfully

Each dimension is converted to a 0–100 score using a formula. Higher always means better (lower emissions, lower risk, higher transparency)

The five scores are combined using a weighted average. Environmental impact counts the most (35%), followed by ethics (25%), then transparency and creativity (15% each), then purpose (10%)

A final score from 0–100 is produced and converted to a grade: A (≥80) down to F (<20). You can also change the weights to reflect what matters most to you

The five dimensions we measure:

🌱

Emissions

How much CO₂ does running this AI system produce? Training a large AI model can emit as much carbon as five cars over their entire lifetime.

⚖️

Ethical Risk

Could this AI cause harm – bias, discrimination, privacy violations, surveillance, or being used as a weapon? This is the strongest predictor of whether an AI is justifiable.

🎨

Creative Displacement

Does this AI replace human artists, writers, musicians, or other creative professionals, potentially destroying livelihoods?

🎯

Purpose

Is there a genuine, important reason for this AI to exist? Saving lives in a hospital is essential. Generating spam emails is not.

🔍

Transparency

Can you understand how the AI made its decision? Can you challenge it? Opaque "black box" systems that no one can explain or audit are a serious concern.

What do the grades mean?

AScore ≥ 80 – Strongly justifiable. Clear benefit, low harm, transparent, responsible.

BScore 65–79 – Well justified. Minor concerns but overall positive.

CScore 50–64 – Borderline. Some real concerns that need addressing.

DScore 35–49 – Poorly justified. Significant problems outweigh the benefits.

EScore 20–34 – Very hard to justify. Serious harms, minimal benefit.

FScore < 20 – Cannot be justified. This AI system causes more harm than good.

Why these weights? The decision to weight environmental impact most heavily (35%) reflects a growing body of research showing that AI's carbon footprint is a severe and underreported problem (Strubell et al., 2019). Ethics is weighted second (25%) because our analysis found it to be the strongest single predictor of justifiability. Transparency (15%) reflects the EU AI Act's legal requirements. All weights are user-adjustable – you are not forced to agree with our defaults.

One important rule – the guardrail. If an AI system is classified as "harmful" in purpose, no amount of high scores on other dimensions can fully compensate. A penalty multiplier is applied (×0.60, or ×0.42 for the most harmful cases). This reflects the principle that you cannot justify a weapon by saying it runs on green energy.

📊 What is this tab?

This tab lets you explore our database of 40 real AI use cases that we have already evaluated — from life-saving medical tools to harmful surveillance systems. Select any use case from the dropdown to see its full score breakdown and understand exactly why it received that score.

You can also adjust the weights below to change how much each dimension counts. For example, if you personally care more about ethics than emissions, slide the Ethics weight up and see how the scores change.

Adjust Weights

Think of weights like marks in an exam – they decide how much each subject counts towards your final grade. The five sliders must add up to 1.00 (100%). Our defaults are based on academic research, but you can change them to reflect your own values.

Total: 1.00

Select a Use Case

Try: Accessibility Captioning Predictive Policing Deepfake Voice Tool Code Assistant High-Emission Chatbot

What do these five component scores mean?

Below you can see the score for each of the five dimensions we measure. Each score is out of 100 – higher is always better (meaning lower risk, lower emissions, or higher transparency). The final Justifiability Score is a weighted combination of all five.

Plain English Analysis

⚡ What is this tab?

Do you use an AI tool and want to know whether it is truly justified? Enter its details below and we will calculate a Justifiability Score for it – using exactly the same scientific method we used for our 40 pre-evaluated use cases.

You don't need to be a technical expert. Each field below is explained so you know exactly what information to provide. If you are unsure about a value, use your best estimate – the tool is designed to be accessible to everyone.

Enter Your AI Tool Details

AI Tool Name

Give the AI system a name so you can identify it in the results. Use underscores instead of spaces (e.g. My_Chatbot).

🌱 Emissions (kgCO₂e per use)

How much CO₂ does one use of this AI produce? Not sure? A typical ChatGPT query ≈ 0.001–0.01 kg. A Google search ≈ 0.0002 kg. A large image generation ≈ 0.03 kg. Training a large language model ≈ 500,000 kg. If you don't know, leave the default (0.008 kg — average chatbot query).

🎯 Purpose Category

Essential — Saves lives, critical infrastructure (e.g. cancer detection, earthquake warning).
Beneficial – Genuinely useful, improves lives (e.g. accessibility tools, language translation).
Low Benefit – Marginal value, mainly convenience or entertainment (e.g. novelty chatbot, AI horoscopes).
Harmful – Designed to deceive, surveil, weaponise or cause harm (e.g. deepfakes, autonomous weapons). Triggers an automatic score penalty.

⚖️ Ethical Risk

1 = No ethical concerns5 = Severe ethical risk

How much potential does this AI have to cause harm? Rate 1 if it is safe and neutral (e.g. a weather forecaster). Rate 5 if it could cause serious harm – discrimination, surveillance, privacy violations, or be used as a weapon. This is the single most influential dimension in the score.

🎨 Creative Displacement

1 = Does not displace anyone5 = Replaces human creatives entirely

Does this AI replace human artists, writers, musicians, designers, or other creative workers? Rate 1 if it helps humans do their jobs better (augmentation). Rate 5 if it fully replaces human creative work, destroying livelihoods.

🔍 Transparency & Explainability

1 = Completely opaque / black box5 = Fully explainable

Can the people affected by this AI understand how it reached its decision? Rate 1 if it is a "black box" – no one can explain its reasoning (e.g. a social media algorithm that bans accounts without explanation). Rate 5 if every decision comes with a clear, checkable explanation (e.g. a loan rejection that lists exactly why you were declined).

Dimension Weights

Optional: adjust how much each dimension counts towards the final score. All five sliders must add up to 1.00. If you are unsure, leave these as they are – the defaults are based on published research.

Total: 1.00

What do these five component scores mean?

Each score is out of 100 – higher is always better. The final Justifiability Score is a weighted combination of all five, using the weights you set above. Click the button below each section to understand the science and data behind each score.

Plain English Analysis

Percentile Rank in Dataset

0255075100

📈 What is this tab?

This table shows all 40 AI systems ranked from most to least justifiable using our scoring framework. Use it to compare systems side by side and see at a glance which AI tools are well justified and which are not.

Try changing the weights below to see how the rankings shift depending on what you care about most. For example, push the Ethics slider to 0.50 and watch which use cases rise and fall in the ranking – it reveals how much the result depends on your value priorities.

Click the 🏷️ Label button on any row for a full breakdown of why that AI system received its score.

Adjust Weights to Re-rank

Move any slider and the entire table re-ranks instantly. This demonstrates that our framework is transparent – you can see exactly how and why each AI system's rank changes based on different priorities.

Total: 1.00

Full Rankings – all 40 AI Use Cases

#	Use Case	Score	Band	Purpose	Label

Score Distribution – How Spread Out Are the Scores?

Each bar represents a 5-point score band (e.g. 60–65). The height of the bar shows how many of the 40 AI systems fall into that band. A good framework should produce a spread-out distribution – if all scores were bunched together in the middle, the tool would not be discriminating enough to be useful.

0255075100