Open-source file fingerprint (checksum) database system
Is there an open-source tool somewhere (whether in polished or
experimental stage) that can help us store/archive and manage file
fingerprints (e.g. a set of file sizes, SHA1 sums, and optionally
filenames and mtimes) for archival purposes? A few case uses include:
1) managing collection of large files (notably, photos, videos, or audio):
These files hardly change from the time we created them. I often make
duplicates of photos in various drives; but now and then I want to clean
up these drives. When cleaning a drive of its files, I do want to make
sure that all deletions are on real duplicates, not on files that are
unique to that drive.
2) data de-duplication, or managing collection of old (deleted) files:
when deleting old files from a storage medium, I often make at least a
duplicate elsewhere as a safety backup while I am cleaning those files.
However, because of the sheer amount (or scattering) of the data, I often
forget whether these files have been cleaned up (ie. they are eligible for
permanent deletion). With this database system, I can store the
fingerprint of the deleted files so that if I encounter that file again
later, I know I can safely delete it.
3) keeping track of file integrity in an archival system (external drive,
CDs/DVDs, etc)--be it for detecting tampering or error/corruption. This
purpose should be clear.
The collection must be easy to manipulate, query, filter, searched on
(e.g., stored in a SQLite database so we can query and do other RDBMS-like
things). A scripting interface (e.g. python API) is a must.
On top of this system I can then make interesting tools such as: file
history tracking (the state of a file at a given time in the past), ...
I would like to find and use an existing system if that is possible; I
want to avoid recreating the wheel if necessary. Sorry if the question may
sound vague to some; I am brainstorming myself too by asking this
question.
No comments:
Post a Comment