De-identify and protect sensitive data

De-identification locates sensitive information like PII, PHI, and PCI from text, PDFs, images, and audio files and then redacts that information with tokens. You can use this de-identified data for a variety of tasks like model training, fine tuning, running secure inference, performing privacy-aware analytics, or any other operational task that requires de-identified unstructured data. De-identifying your data also lets you protect detected sensitive information within the vault to operate on in the future, such as using Skyflow’s advanced governance capabilities to control how users access the sensitive data or re-identify the associated tokens. You can de-identify your data via APIs for discrete transactions or via batch workflows with Pipelines.

Key capabilities

With de-identification, you can make full use of your data without exposing sensitive details. Below are a few ways you can leverage de-identification in your organization.

  • Many languages and entities supported. De-identify a broad range of personally identifiable information including names, addresses, and credit card numbers across more than 40 languages with support for 60 entity types.

  • Work with various file formats and media types, such as PDFs, CSVs, DOCs, images, and audio.

  • Extend de-identification capabilities to domain-specific data by defining regex patterns that meet your business needs.

  • Custom redaction/tokenization options customize how Skyflow masks or redacts sensitive entities in your data. Choose vault tokens (like NAME_ABC123), entity-unique counters, or entity-only tokens. You can also select from format-preserving tokens, UUIDs, or numeric combinations for vault tokens.

  • Re-identifying redacted unstructured data applies role-based controls to define who can see masked, redacted, or plaintext data, ensuring compliance and protection of privacy policies at all levels.

  • Guardrails specify input and output controls to keep your LLM on-topic. Control the types of information and interactions that your LLM can process based on your use case and end user.

  • Ingest data in real time or in batches. Ingest through APIs or periodically in large batches using Pipelines.

Supported data formats

De-identification supports a variety of data formats.

Data TypeSupported OperationsDetails
TextDe-identification & re-identificationDe-Identify text files and plain text, optionally storing de-identified sensitive data for later use.
FilesDe-identification & re-identificationScans files such as CSV, DOC, and PPT to de-identify sensitive data, optionally storing the de-identified sensitive data for later use.
PDFRedactionRedact sensitive data from PDFs. Optionally adjust PDF density and resolution for better quality.
AudioDe-identification (Audio-to-Text)Transcribes spoken content (like call recordings and voice messages) into plain or diarized transcripts. Supports de-identification or redaction of sensitive segments in the transcript.
AudioRedaction (Audio-to-Audio)Redacts sensitive segments directly in the audio file using techniques like bleeping, silencing, or voice morphing.
ImagesRedactionUses OCR to redact text, find and redact faces, logos, and both handwritten and digital signatures.

Next steps

Now that you understand the Skyflow’s de-identification capabilities at a high level, see the following content for how to integrate and customize de-identification in your processes: