AIgovernmentprocurementethics

AI in Public Safety Procurement: Questions Agencies Should Ask Before Buying

JJordan Mercer

2026-04-23

17 min read

A decision-maker’s checklist for buying AI in public safety—focused on oversight, bias, transparency, and vendor accountability.

Artificial intelligence is moving quickly from pilot projects into police departments, prosecutors’ offices, emergency management teams, and other public safety environments. That speed creates a procurement problem as much as a technology problem: agencies are often asked to decide whether a tool is useful before they have fully determined whether it is governable. A strong buying process should not start with a vendor demo; it should start with a disciplined review of risk, oversight, transparency, and operational fit. For agencies building a responsible acquisition process, our guide to human + AI workflows offers a useful mindset: AI should fit inside a controlled workflow, not replace the workflow itself.

The stakes are especially high in criminal justice because these tools can affect liberty, credibility, and public trust. A model that recommends, ranks, flags, or predicts can shape stops, searches, charging decisions, supervision, dispatch priorities, and investigative leads. If the system is opaque, poorly validated, or impossible to audit, the agency may inherit legal exposure and operational fragility alongside the software. That is why procurement teams should treat AI like any other high-risk government technology, but with extra caution on fairness, human review, and vendor accountability, much like the due diligence discussed in how to vet providers before you buy.

1. Start With the Use Case, Not the Hype

Define the decision the tool will support

Before evaluating vendors, agencies need a plain-language answer to one question: what decision or task is AI supposed to improve? A tool used to sort video evidence has a different risk profile than one used to prioritize subjects for investigation, and a system that automates administrative routing is not the same as a model that influences human liberty. Procurement should require a written use-case statement that spells out the process today, what pain point the agency is trying to solve, and where human review will remain mandatory. Without that clarity, agencies tend to buy “AI capability” instead of solving an actual operational problem.

Separate automation from decision support

One of the most common procurement mistakes is confusing efficiency with authority. AI can summarize, classify, alert, score, or suggest, but it should not be allowed to silently decide when the result has public safety implications. Agencies should ask whether the product is decision-support only, or whether it attempts to automate outcomes in a way that could change enforcement or legal status. As a practical reference point, the discipline used in evaluating AI coding assistants applies here too: the best tools amplify experts, but the organization must still own the final judgment.

Map where AI enters the workflow

Good procurement requires workflow mapping, not just product comparison. Agencies should identify where the AI starts, where humans review outputs, what happens when the model is wrong, and who is accountable for exceptions. That map should include escalation paths for contested cases and procedures for shutting the system off if the model behaves unpredictably. If the vendor cannot explain the workflow in operational terms, the agency has probably not found a mature product.

2. Build a Public-Sector Risk Framework Before You Issue an RFP

Classify the system by risk level

Not every AI tool deserves the same level of scrutiny, but criminal justice tools generally sit in a higher-risk category because errors can have real-world consequences. Agencies should classify the system by how much it can affect rights, access, or enforcement priorities. A low-risk document sorter does not require the same controls as a risk-scoring or surveillance analytics product. This upfront classification should drive the procurement path, approval chain, training requirements, and audit obligations.

Set non-negotiable governance requirements

Before vendor outreach, agencies should define non-negotiable requirements in policy form. These often include human-in-the-loop review, audit logging, explainability thresholds, data minimization, cybersecurity controls, and a right to suspend use if the tool becomes noncompliant. If those requirements are only added during implementation, the procurement team may already be locked into a contract that undermines them. Agencies that are building a broader oversight program can borrow from the planning discipline in AI-powered moderation pipelines, where rules, logging, and escalation are designed before automation is scaled.

Align legal, procurement, IT, and frontline staff

Public safety procurement often fails when each team evaluates the product from a narrow lens. Legal may focus on liability, IT on integration, finance on price, and operations on speed, but no one may own the full risk picture. Agencies should form a review group that includes procurement, legal counsel, records management, cybersecurity, the relevant command staff, and a representative from the frontline workflow. The goal is not committee bloat; it is to ensure the final purchase reflects operational reality rather than a sales narrative.

3. Questions to Ask About Algorithmic Bias and Fairness

What data trained the model?

Bias discussions should begin with the data pipeline, because a model can only reflect the patterns it was trained on. Agencies should ask what datasets were used, from what jurisdictions or demographics, over what time periods, and whether the vendor can describe known gaps. If the system was trained on historical enforcement data, the agency must ask whether those records encode prior disparities in stops, arrests, or sentencing. This is not an abstract ethics issue; it is a forecasting issue that can affect accuracy, legitimacy, and constitutional exposure.

Has the vendor tested disparate impact?

Agencies should request concrete evidence of bias testing, not just a declaration that the product is “fair.” That means subgroup performance metrics, false positive and false negative rates, and testing across race, gender, age, geography, language, and other relevant categories when legally and operationally appropriate. The vendor should also explain how the model behaves when the data distribution changes, because bias often grows when a system is used outside its original population. For a useful analogy in a different context, see how AI ethics decisions in teen gaming show that safety claims matter less than measurable guardrails.

What is the appeal and correction process?

Any system that affects a person’s status, risk level, or enforcement attention should have a clear review path. Agencies must know whether an individual can challenge an output, whether human reviewers can override the model, and whether the override is recorded for later audit. If a vendor cannot support a structured correction process, the agency may have bought a black box with no meaningful remedy. That is a governance failure, not just a technical limitation.

4. Transparency Is a Procurement Requirement, Not a Bonus Feature

Demand plain-language system documentation

Transparency begins with documentation that non-engineers can understand. Agencies should require a model card, system description, data source summary, intended-use statement, known limitations, version history, and deployment guidance. The documentation should explain how the tool was trained, what its confidence scores mean, and what circumstances are outside the model’s design envelope. If the vendor only offers sales collateral and a confidentiality clause, the agency should treat that as a warning sign.

Ask what can be explained to the public

Public safety agencies do not operate in a vacuum; they are accountable to the public, oversight bodies, and sometimes courts. Procurement teams should ask whether the vendor can support public-facing disclosures about tool purpose, safeguards, and governance. Even when some technical details are proprietary, the agency should be able to explain the system’s role in plain English, especially if a public records request or media inquiry arrives. Agencies that want a practical reference point for trust-building can look to the logic behind trust signals in the age of AI: credibility is built through visible evidence, not slogans.

Can outputs be audited after the fact?

If an AI tool cannot generate logs that show what it saw, what it produced, who reviewed it, and what happened next, oversight will be weak. Agencies should require audit trails that are time-stamped, immutable where possible, and retained according to records policy. Without that record, it becomes difficult to reconstruct errors, defend decisions, or identify patterns of misuse. The most defensible systems are the ones that leave a detailed trail.

5. Human Oversight: What It Should Actually Look Like

Define the human role in measurable terms

“Human oversight” is meaningless unless it is operationalized. Agencies should specify whether the human reviewer is a checker, approver, investigator, or decision-maker, and what level of review is required before the output can be acted on. A superficial review, where staff are expected to rubber-stamp a recommendation under time pressure, offers little real protection. The oversight standard should be strong enough that a trained person can detect obvious errors, question edge cases, and stop inappropriate use.

Set staffing and training thresholds

Oversight only works if staff understand the tool’s limitations. Agencies need training on false confidence, automation bias, model drift, and when not to rely on the output. They should also budget for staffing, because a tool that requires review but has no reviewer capacity creates pressure to shortcut the process. In other words, AI procurement should be treated like an operations change, not just a software purchase; the same principle appears in how tech companies maintain trust during outages, where resilience depends on process, not promises.

Create stop-use triggers

Agencies should establish the conditions under which AI use is paused, restricted, or terminated. Those triggers may include significant error rates, unexplained drift, data access violations, cybersecurity incidents, or evidence of discriminatory impact. A good vendor will not resist this; a mature vendor should welcome it, because it shows the agency is serious about governance. The ability to stop use is not an admission of failure; it is a basic control in a high-risk environment.

6. Vendor Accountability: What Should Be in the Contract

Ownership of model updates and change notices

Contracts must address what happens when the vendor changes the model, updates the training data, or modifies the interface. Agencies should require advance notice of material changes, regression testing after updates, and the right to reject a new version that degrades performance or transparency. This matters because a product may be safe and well-understood at procurement, then change substantially a year later without adequate warning. Agencies that manage software upgrades carefully can borrow from the mindset in best practices for update pitfalls, where testing is part of governance rather than an afterthought.

Indemnification, warranties, and performance claims

Contract language should not rely on vague performance assurances. Agencies should ask what the vendor is warranting, what remedies exist if the product fails, and whether the vendor will indemnify the agency for IP, privacy, or data-processing violations where appropriate. At the same time, agencies should be realistic: many vendors will resist broad liability. That makes documentation, validation, and pilot testing even more important, because a weak contract combined with a weak evidence base leaves the agency exposed.

Data ownership and secondary use restrictions

Agencies must know who owns input data, derived data, annotations, logs, and model outputs. They should also prohibit the vendor from using agency data to train other customers’ systems unless there is a clear, reviewed policy basis and legal authority. In criminal justice, secondary use can create privacy, due process, and public trust problems even if it is technically allowed under a standard SaaS contract. Data restrictions should be written in explicit, auditable language.

7. Compare Vendors With a Structured Scorecard

Procurement teams often compare AI vendors by feature list and price, but that approach misses the governance factors that determine whether a tool can survive scrutiny. A better method is a structured scorecard that weights legality, transparency, oversight support, and operational reliability alongside core functionality. Agencies should score vendors consistently, document the rationale, and keep the records for audit and procurement review. If your team is building a disciplined sourcing process, the approach in how to vet an equipment dealer before you buy is a helpful template: good buyers ask hard questions before they sign.

Evaluation Area	What to Ask	Strong Answer Looks Like	Red Flag
Use case fit	What exact task does the tool support?	Specific workflow, clear boundaries	Broad “transformative” claims
Bias testing	What subgroup testing has been done?	Published metrics and methodology	No testing or NDA-only claims
Human oversight	What must a human review before action?	Defined review steps and authority	“Reviewer as needed” language
Transparency	What documentation and logs are provided?	Model card, audit logs, change notices	Black-box outputs only
Vendor accountability	How are updates, errors, and remedies handled?	Contractual notices, regression tests, remedies	No update controls
Data governance	Can data be reused or retained by vendor?	Explicit limits and deletion rights	Open-ended reuse rights

Weight governance over flashy demos

Many AI products look impressive in controlled demos because the vendor has curated the examples. A scorecard forces the agency to measure what matters in real use: reliability, explainability, and the ability to support oversight under normal operating pressure. The best procurement teams assign heavier weights to issues like bias, privacy, auditability, and lifecycle support than to presentation polish. That keeps the decision grounded in public value rather than marketing confidence.

Document why the winner won

Every selection should be explainable after the fact. The record should show why the chosen system was preferable, what tradeoffs were accepted, and what safeguards were put in place to mitigate those tradeoffs. That documentation helps if the contract is challenged, if the press asks questions, or if a future administration needs to understand the rationale. Good records are not bureaucracy; they are institutional memory.

8. Pilot Programs Should Be Designed Like Experiments

Set measurable success criteria

Agencies should not call something a pilot unless they have a hypothesis to test. Success metrics might include reduced turnaround time, fewer manual errors, higher investigator throughput, or improved triage consistency. But the pilot should also measure unintended consequences, such as over-reliance, discrepancy rates, and workload shifts to other staff. An honest pilot gives the agency evidence, not just a sense of momentum.

Include a control group or baseline where possible

Without a baseline, it is difficult to know whether AI truly improved outcomes. Agencies should compare AI-assisted workflows against prior performance, and when feasible, use parallel testing with human-only review. That comparison helps isolate whether the tool is genuinely better or simply faster. It is especially important in criminal justice, where speed alone can hide error.

Limit scope before scale

Start with a narrow use case, a limited dataset, and a defined set of users. That reduces the chance that problems spread before the agency understands them. Only after the pilot shows stable results should the agency expand the tool into additional units or higher-risk decisions. This incremental approach mirrors the caution shown in AI viability evaluations, where the question is not whether the tool works once, but whether it works reliably in the environments that matter.

9. What Good Oversight Looks Like After Deployment

Monitor drift, complaints, and exceptions

Deployment is not the end of procurement. Agencies should monitor performance over time, including drift in accuracy, spikes in overrides, complaints from staff or the public, and distribution changes in the data. If the model starts behaving differently as conditions change, the agency needs a plan to investigate and correct it. Ongoing monitoring should be built into the contract and into internal responsibility assignments.

Schedule periodic independent review

High-risk AI tools should be reviewed on a recurring basis by a team that is not the same group that procured or operates the system. That independence matters because internal teams can become normalized to a tool’s flaws. Reviewers should test whether the system still matches the original use case, whether governance controls are working, and whether the system remains legally defensible. Independent review is one of the clearest signs that the agency takes accountability seriously.

Prepare for public records and oversight questions

Any public safety AI program should assume it will eventually be scrutinized by elected officials, oversight boards, litigants, journalists, or community groups. Agencies should have ready answers about purpose, testing, safeguards, complaint handling, and the role of humans in the loop. The public does not need every technical detail, but it does need confidence that the agency understands its own system. That confidence is earned by process, not by rhetoric.

10. A Practical Checklist Agencies Can Use Before Buying

The decision-maker’s questions

Before approving an AI procurement, agencies should be able to answer the following: What exact problem is the tool solving? What decisions will it influence? Where is human review mandatory? What data trained the model? What bias testing has been done? What logs, documentation, and change notices will the vendor provide? What happens when the tool makes a mistake? What is the stop-use trigger? If the agency cannot answer these questions in writing, it is not ready to buy.

Procurement should require evidence, not assurance

Vendor presentations are designed to persuade; procurement is supposed to evaluate. Agencies should insist on proof: test results, technical documentation, sample logs, references from comparable public-sector users, and a contract that preserves the right to audit and suspend. This is especially important when the product uses sensitive data or may influence enforcement decisions. For another example of buying with discipline, see provider vetting guidance, where the central lesson is to verify before committing.

Build the checklist into policy

The most durable approach is to convert the checklist into a standard procurement policy. That means every department uses the same review questions, every high-risk vendor faces the same documentation requirements, and every approval is recorded. Once the checklist becomes policy, the agency is less vulnerable to rushed purchases, personality-driven exceptions, and post-hoc rationalizations. In a field as consequential as public safety, consistency is a form of fairness.

Pro Tip: If a vendor cannot clearly explain how their system would be audited after a mistaken stop, a false lead, or a disputed flag, the agency should pause the procurement. In public safety, the cost of a bad answer is not just wasted money; it can be damage to trust, legitimacy, and individual rights.

Frequently Asked Questions

What is the biggest mistake agencies make when buying AI for public safety?

The biggest mistake is buying a tool before defining the use case and risk level. Agencies often focus on features or vendor reputation instead of the decision the tool will influence, the human review required, and the consequences of error. That leads to contracts that are hard to govern and even harder to defend.

Should AI ever make final decisions in criminal justice?

For high-stakes criminal justice uses, agencies should be extremely cautious about final automated decisions. The safest approach is usually human decision-making with AI as support, not replacement. If a system meaningfully affects liberty, enforcement, or case outcomes, human review and accountability should remain central.

What evidence should a vendor provide on bias?

At minimum, the vendor should provide subgroup performance data, methodology, validation results, and information about the training data. Agencies should look for false positive and false negative rates across relevant groups and ask how the model performs when conditions change. A claim of fairness without metrics is not enough.

Why are audit logs so important?

Audit logs make it possible to reconstruct what the system saw, what it produced, who reviewed it, and what action followed. Without logs, agencies cannot investigate errors, respond to complaints, or prove that human oversight actually occurred. Logs are essential for accountability, transparency, and legal defense.

How can small agencies manage AI procurement with limited staff?

Small agencies should narrow the scope of use cases, standardize the checklist, and insist on simpler systems that are easier to document and audit. They can also seek shared services, external technical review, or procurement templates developed by peer agencies. The goal is not to do everything alone; it is to avoid buying complexity the agency cannot manage.

What should trigger a pause in deployment?

Common stop-use triggers include significant error rates, unexplained drift, cybersecurity incidents, failure to provide promised logs or notices, and evidence of disparate impact or misuse. Agencies should define these triggers before launch, not after a problem emerges. That way, pausing the system is a governance action rather than a crisis response.

Human + AI Workflows: A Practical Playbook for Engineering and IT Teams - Learn how to design oversight into AI-enabled operations from day one.
Is Your Smart Security Brand Built to Last? How to Vet Providers Before You Buy - A practical vendor-review framework for high-trust purchasing.
Evaluating the Viability of AI Coding Assistants: Insights from Microsoft and Anthropic - Useful for understanding when AI tools actually improve work.
Understanding Outages: How Tech Companies Can Maintain User Trust - A resilience mindset that translates directly to public systems.
Navigating Microsoft’s January Update Pitfalls: Best Practices for IT Teams - Why change control and testing matter before every rollout.

Jordan Mercer

Senior Public Affairs Editor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.