De-identify and protect sensitive data | Skyflow

De-identification locates sensitive information like PII, PHI, and PCI from text, PDFs, images, and audio files and then redacts that information with tokens. You can use this de-identified data for a variety of tasks like model training, fine tuning, running secure inference, performing privacy-aware analytics, or any other operational task that requires de-identified unstructured data. De-identifying your data also lets you protect detected sensitive information within the vault to operate on in the future, such as using Skyflow’s advanced governance capabilities to control how users access the sensitive data or re-identify the associated tokens. You can de-identify your data via APIs for discrete transactions or via batch workflows with Pipelines.

Key capabilities

With de-identification, you can make full use of your data without exposing sensitive details. Below are a few ways you can leverage de-identification in your organization.

Many languages and entities supported. De-identify a broad range of personally identifiable information including names, addresses, and credit card numbers across more than 40 languages with support for 60 entity types.
Work with various file formats and media types, such as PDFs, CSVs, DOCs, images, and audio.
Extend de-identification capabilities to domain-specific data by defining regex patterns that meet your business needs.
Custom redaction/tokenization options customize how Skyflow masks or redacts sensitive entities in your data. Choose vault tokens (like NAME_ABC123), entity-unique counters, or entity-only tokens. You can also select from format-preserving tokens, UUIDs, or numeric combinations for vault tokens.
Re-identifying redacted unstructured data applies role-based controls to define who can see masked, redacted, or plaintext data, ensuring compliance and protection of privacy policies at all levels.
Guardrails for AI interactions control LLM inputs and outputs to keep your AI interactions safe, compliant, and on-topic. Block toxic language with toxicity checking and restrict certain subjects with denied topics. Guardrails help prevent inappropriate content, prompt injection attacks, and off-topic responses before processing sensitive data.
Ingest data in real time or in batches. Ingest through APIs or periodically in large batches using Pipelines.

Supported data formats

De-identification supports a variety of data formats.

Data Type	Supported Operations	Details
Text	De-identification & re-identification	De-Identify text files and plain text, optionally storing de-identified sensitive data for later use.
Files	De-identification & re-identification	Scans files such as CSV, DOC, and PPT to de-identify sensitive data, optionally storing the de-identified sensitive data for later use.
PDF	Redaction	Redact sensitive data from PDFs. Optionally adjust PDF density and resolution for better quality.
Audio	De-identification (Audio-to-Text)	Transcribes spoken content (like call recordings and voice messages) into plain or diarized transcripts. Supports de-identification or redaction of sensitive segments in the transcript.
Audio	Redaction (Audio-to-Audio)	Redacts sensitive segments directly in the audio file using techniques like bleeping, silencing, or voice morphing.
Images	Redaction	Uses OCR to redact text, find and redact faces, logos, and both handwritten and digital signatures.

Next steps

Now that you understand the Skyflow’s de-identification capabilities at a high level, see the following content for how to integrate and customize de-identification in your processes:

De-identify a string

De-identify sensitive data in a string.

Re-identify a string

Selectively re-identify data in a string.

De-identify a file

De-identify sensitive data in a file.

Add guardrails to AI interactions

Block toxic inputs and restrict topics in your LLM workflows.

Detect API reference

Comprehensive usage details for the Detect API.