AUTHOR
Julien Delange, Founder and CEO
Julien is the CEO of Codiga. Before starting Codiga, Julien was a software engineer at Twitter and Amazon Web Services.
Julien has a PhD in computer science from Universite Pierre et Marie Curie in Paris, France.
What is serialization and deserialization? How does it work in Python?
Serialization transforms an object into a byte stream. Deserialization is the inverse process: taking a byte stream to create an object.
The reason for saving objects and restoring them to/from byte streams is to be able to communicate objects through the filesystem and the network. For example, in distributing system, the server may receive objects from a client (or the other way around - the server can send objects back to the client).
Why is deserialization unsafe?
Deserialization is unsafe for one simple reason: you cannot trust the binary value passed to you. While there is no problem with serializing data (e.g. sending to someone else), deserialization takes a random byte stream and converts it to an object. There is absolutely no guarantee that the object is safe to be used and can include code that may compromise your system.
Unsafe deserialization is a common software weakness. MITRE, in their Common Weakness Enumeration (CWE) system, references it under CWE-502: Deserialization of Untrusted Data
This blog post
illustrates how unsafe deserialization works with Python and the standard pickle
module.
What Python modules are vulnerable to unsafe deserialization?
While it's hard to enumerate all Python modules that serialize/deserialize data, the most used are:
pickle
(from the Python standard distribution)pandas
(withread_pickle()
)shelve
(withopen
)
Note that the documentation of all these modules mentions security concerns and warns developers only to deserialize data from trusted sources. In very specific cases, it might be safe to deserialize data (e.g., when loading data you previously saved on your local machine). In the vast majority of cases, it's unsafe and highly not recommended to deserialize data.
How to safely serialize and deserialize data?
Unfortunately, there is no silver bullet, and the safest way to deserialize data is not to rely on deserialization and instead use API that exchanges the data you need.
If you need to communicate data in a binary format (for performance reasons), using binary protocols like protobuf or thrift are more secure and appropriate.
Automatically detect unsafe deserialization
Codiga provides IDE plugins and integrations with GitHub, GitLab, or Bitbucket to detect unsafe deserialization for multiple Python modules
(pickle
, shelve
, pandas
).
The Codiga static code analysis detects unsafe deserialization in your IDE or code reviews ; here is a dedicated rule. This rule detects unsafe deserialization
from the following Python modules: pickle
, shelve
and pandas
.
To use this rule consistently, all you need to do is to install the integration in your IDE (for VS Code or JetBrains) or code management system and add a codiga.yml
file at the root of your profile with the following content:
rulesets:
- python-security
It will then check all your Python code against 100+ rules that detect unsafe and insecure code and suggests fixes for each of them.