De-identify and protect sensitive data
De-identification locates sensitive information like PII, PHI, and PCI from text, PDFs, images, and audio files and then redacts that information with tokens. You can use this de-identified data for a variety of tasks like model training, fine tuning, running secure inference, performing privacy-aware analytics, or any other operational task that requires de-identified unstructured data. De-identifying your data also lets you protect detected sensitive information within the vault to operate on in the future, such as using Skyflow’s advanced governance capabilities to control how users access the sensitive data or re-identify the associated tokens. You can de-identify your data via APIs for discrete transactions or via batch workflows with Pipelines.
Key capabilities
With de-identification, you can make full use of your data without exposing sensitive details. Below are a few ways you can leverage de-identification in your organization.
-
Many languages and entities supported. De-identify a broad range of personally identifiable information including names, addresses, and credit card numbers across more than 40 languages with support for 60 entity types.
-
Work with various file formats and media types, such as PDFs, CSVs, DOCs, images, and audio.
-
Extend de-identification capabilities to domain-specific data by defining regex patterns that meet your business needs.
-
Custom redaction/tokenization options customize how Skyflow masks or redacts sensitive entities in your data. Choose vault tokens (like
NAME_ABC123
), entity-unique counters, or entity-only tokens. You can also select from format-preserving tokens, UUIDs, or numeric combinations for vault tokens. -
Re-identifying redacted unstructured data applies role-based controls to define who can see masked, redacted, or plaintext data, ensuring compliance and protection of privacy policies at all levels.
-
Guardrails specify input and output controls to keep your LLM on-topic. Control the types of information and interactions that your LLM can process based on your use case and end user.
-
Ingest data in real time or in batches. Ingest through APIs or periodically in large batches using Pipelines.
Supported data formats
De-identification supports a variety of data formats.
Next steps
Now that you understand the Skyflow’s de-identification capabilities at a high level, see the following content for how to integrate and customize de-identification in your processes: