btrfs deduplication and rsync to other file system... Hello, I'm looking on btrfs deduplication system, because I have *data* (not system) in archives with many duplicates texts. The total is now around 1Tb large and growing and I would like to keep it on 1Tb disks (I have several copies). Often old text work or mail archives. I tried fdup or similar deduplication application, but then I got a scattered archive. I still have the data, but the files that are together are now on various folders. I had to stop doing so. If I understand well, btrfs deduplication works with links (the removed files are replaced by a symlink). duperemove: https://btrfs.readthedocs.io/en/latest/Deduplication.html this needs btrfs file system (or similar high end one, I will stay with btrfs) question: what happen if I rsync the data to and ext4 file system (on an other disk? Is deduplication lost or kept? same question is I sync this to the cloud. Why? because I wonder if my archives may be readable by an other (linux) system in the future, not every one support btrfs when ext4 is global. by the way, I don't find up to date documentation about btrfs backup (the wiki is flagged "obsolete"). If I correctly explained my problem, one may understand that I want to *archive data*. No incremental backup needed. I keep on total 4Tb of data (but the 3 more Tb data are photos, videos and iso files kept unique and already compressed, no chance to gain size). and in fine, obviously, if one have any better solution to what I try to do, I'll take it :-) thanks jdd -- https://artdagio.fr
Hello, In the Message; Subject : btrfs deduplication and rsync Message-ID : <58db269b-392a-4f40-8eb9-10578cb94a49@dodin.org> Date & Time: Fri, 3 Nov 2023 09:52:21 +0100 [jdd] == "jdd@dodin.org" <jdd@dodin.org> has written: jdd> btrfs deduplication and rsync to other file system... jdd> Hello, jdd> I'm looking on btrfs deduplication system, because I have *data* jdd> (not system) in archives with many duplicates texts. The total jdd> is now around 1Tb large and growing and I would like to keep it jdd> on 1Tb disks (I have several copies). Often old text work or jdd> mail archives. [...] jdd> and in fine, obviously, if one have any better solution to what I try to do, jdd> I'll take it :-) As I mentioned before, I migrated from an ext4 file system to a btrfs file system. I read a lot of information on the net that a clean install was the only way to migrate, but after installing it on my Laptop computer, I decided that the ext4 file system and the btrfs file system are structurally the same, so I migrated to the btrfs file system using the ext4 file system. After a clean install of Tumbleweed with btrfs (with the same kernel as the backed up system), I restored the backup and it is working fine. I often install and test new software, and usually restore the backup I took with rsync before the test after the test is finished, and this has never caused any problems. I hope this helps. Regards & A little early, but Good Night. --- ┏━━┓彡 野宮 賢 mail-to: nomiya @ lake.dti.ne.jp ┃\/彡 ┗━━┛ "Maddox hopes that empowering users to pick their own algorithms will get them to think more about what’s involved in making them. " -- Bluesky's Custom Algorithms Could Be the Future of Social Media --
On Fri, Nov 3, 2023 at 11:52 AM jdd@dodin.org <jdd@dodin.org> wrote:
btrfs deduplication and rsync to other file system...
Hello,
I'm looking on btrfs deduplication system, because I have *data* (not system) in archives with many duplicates texts. The total is now around 1Tb large and growing and I would like to keep it on 1Tb disks (I have several copies). Often old text work or mail archives.
I tried fdup or similar deduplication application, but then I got a scattered archive. I still have the data, but the files that are together are now on various folders. I had to stop doing so.
If I understand well, btrfs deduplication works with links (the removed
You are mistaken. It works on data extent level and has nothing to do with links. You are probably confused by the "link" in "reflink".
files are replaced by a symlink). duperemove:
https://btrfs.readthedocs.io/en/latest/Deduplication.html
this needs btrfs file system (or similar high end one, I will stay with btrfs)
question: what happen if I rsync the data to and ext4 file system (on an other disk? Is deduplication lost or kept?
btrfs deduplication cannot be kept on ext4 by definition, because ext4 does not support reflinks. And even if it did, rsync does not support them (it works in units of files, not extents), so deduplication would be lost even if you rsync from btrfs to btrfs.
same question is I sync this to the cloud.
Ask your cloud provider what they do with data they receive. If you are concerned with the amount of data you upload and its cost - it is the question to the tool you use to "sync this to the cloud". Your tool may perform client side deduplication. But whatever happens, it is completely independent of and unaware of what btrfs does.
Why? because I wonder if my archives may be readable by an other (linux) system in the future, not every one support btrfs when ext4 is global.
by the way, I don't find up to date documentation about btrfs backup (the wiki is flagged "obsolete").
btrfs is just a filesystem which can be backed up as any other filesystem.
If I correctly explained my problem, one may understand that I want to *archive data*. No incremental backup needed. I keep on total 4Tb of data (but the 3 more Tb data are photos, videos and iso files kept unique and already compressed, no chance to gain size).
and in fine, obviously, if one have any better solution to what I try to do, I'll take it :-)
thanks jdd -- https://artdagio.fr
Le 03/11/2023 à 12:30, Andrei Borzenkov a écrit :
btrfs is just a filesystem which can be backed up as any other filesystem.
obviously not, if deduplication is applied, as you explained just before. but I know there are btrfs to btrfs way to do it (sort of copy), but I don't know the details thanks jdd -- https://artdagio.fr
On 03.11.2023 15:13, jdd@dodin.org wrote:
Le 03/11/2023 à 12:30, Andrei Borzenkov a écrit :
btrfs is just a filesystem which can be backed up as any other filesystem.
obviously not, if deduplication is applied, as you explained just before.
You have rather strange idea about what backup is. Of course you can backup the content of btrfs filesystem like any other filesystem. It has absolutely nothing to do with it being deduplicated or not. If you want to *also* preserve space savings - it is rather different topic, only tangentially related to "backup".
but I know there are btrfs to btrfs way to do it (sort of copy), but I don't know the details
There is "btrfs send/receive" which should preserve space savings (but does not guarantee it).
Le 03/11/2023 à 18:03, Andrei Borzenkov a écrit :
On 03.11.2023 15:13, jdd@dodin.org wrote:
but I know there are btrfs to btrfs way to do it (sort of copy), but I don't know the details
There is "btrfs send/receive" which should preserve space savings (but does not guarantee it).
yes, it's this, thanks jdd -- https://artdagio.fr
Le 03/11/2023 à 12:30, Andrei Borzenkov a écrit :
On Fri, Nov 3, 2023 at 11:52 AM jdd@dodin.org <jdd@dodin.org> wrote:
If I understand well, btrfs deduplication works with links (the removed
You are mistaken. It works on data extent level and has nothing to do with links. You are probably confused by the "link" in "reflink".
my problem is this part https://btrfs.readthedocs.io/en/latest/Deduplication.html#file-based-dedupli... where reflink is only part of the use. I may have to experiment the tool I need would replace duplicates files with symlinks (not necessarily with btrfs) so spâce is got but files are still where they belong :-) thanks jdd -- https://artdagio.fr
On 03.11.2023 15:23, jdd@dodin.org wrote:
Le 03/11/2023 à 12:30, Andrei Borzenkov a écrit :
On Fri, Nov 3, 2023 at 11:52 AM jdd@dodin.org <jdd@dodin.org> wrote:
If I understand well, btrfs deduplication works with links (the removed
You are mistaken. It works on data extent level and has nothing to do with links. You are probably confused by the "link" in "reflink".
my problem is this part
https://btrfs.readthedocs.io/en/latest/Deduplication.html#file-based-dedupli...
It describes how data which should be deduplicated is selected. btrfs does not have any built-in means to deduplicate anything (except as side effect of snapshots), so deduplication works by selecting some data, analyzing it and then replacing identical data by pointer to the same data in some other place. You may chose to deduplicate only between some files or the whole filesystem content.
where reflink is only part of the use. I may have to experiment
the tool I need would replace duplicates files with symlinks (not necessarily with btrfs) so spâce is got but files are still where they belong :-)
thanks jdd
On 2023-11-03 13:23, jdd@dodin.org wrote:
Le 03/11/2023 à 12:30, Andrei Borzenkov a écrit :
On Fri, Nov 3, 2023 at 11:52 AM jdd@dodin.org <jdd@dodin.org> wrote:
If I understand well, btrfs deduplication works with links (the removed
You are mistaken. It works on data extent level and has nothing to do with links. You are probably confused by the "link" in "reflink".
my problem is this part
https://btrfs.readthedocs.io/en/latest/Deduplication.html#file-based-dedupli...
where reflink is only part of the use. I may have to experiment
the tool I need would replace duplicates files with symlinks (not necessarily with btrfs) so spâce is got but files are still where they belong :-)
"traditional" filesystem independent deduplication replaces identical files with hardlinks (not softlinks). These can be rsynced, IF you tell rsync to keep links. OPTIONS="--archive --acls --xattrs --hard-links --sparse --stats --human-readable " Windows filesystem do not support them. Maybe recent ntfs. Some compressed archives can respect them. -- Cheers / Saludos, Carlos E. R. (from openSUSE 15.5 (Laicolasse))
Le 03/11/2023 à 19:35, Carlos E. R. a écrit :
On 2023-11-03 13:23, jdd@dodin.org wrote:
Le 03/11/2023 à 12:30, Andrei Borzenkov a écrit :
On Fri, Nov 3, 2023 at 11:52 AM jdd@dodin.org <jdd@dodin.org> wrote:
If I understand well, btrfs deduplication works with links (the removed
You are mistaken. It works on data extent level and has nothing to do with links. You are probably confused by the "link" in "reflink".
my problem is this part
https://btrfs.readthedocs.io/en/latest/Deduplication.html#file-based-dedupli...
where reflink is only part of the use. I may have to experiment
the tool I need would replace duplicates files with symlinks (not necessarily with btrfs) so spâce is got but files are still where they belong :-)
"traditional" filesystem independent deduplication replaces identical files with hardlinks (not softlinks).
what application?? thanks jdd
These can be rsynced, IF you tell rsync to keep links.
OPTIONS="--archive --acls --xattrs --hard-links --sparse --stats --human-readable "
Windows filesystem do not support them. Maybe recent ntfs.
Some compressed archives can respect them.
On 2023-11-03 19:39, jdd@dodin.org wrote:
Le 03/11/2023 à 19:35, Carlos E. R. a écrit :
On 2023-11-03 13:23, jdd@dodin.org wrote:
Le 03/11/2023 à 12:30, Andrei Borzenkov a écrit :
On Fri, Nov 3, 2023 at 11:52 AM jdd@dodin.org <jdd@dodin.org> wrote:
If I understand well, btrfs deduplication works with links (the removed
You are mistaken. It works on data extent level and has nothing to do with links. You are probably confused by the "link" in "reflink".
my problem is this part
https://btrfs.readthedocs.io/en/latest/Deduplication.html#file-based-dedupli...
where reflink is only part of the use. I may have to experiment
the tool I need would replace duplicates files with symlinks (not necessarily with btrfs) so spâce is got but files are still where they belong :-)
"traditional" filesystem independent deduplication replaces identical files with hardlinks (not softlinks).
what application??
I don't have one yet. This is the question I asked the last week and I'm still studying. I have managed to create those links with rsync itself when the filenames match. -- Cheers / Saludos, Carlos E. R. (from openSUSE 15.5 (Laicolasse))
Le 03/11/2023 à 20:18, Carlos E. R. a écrit :
I don't have one yet. This is the question I asked the last week and I'm still studying.
I follox the thread but don't understand all :-(
I have managed to create those links with rsync itself when the filenames match.
is that the command you wrote in your answer? OPTIONS="--archive --acls --xattrs --hard-links --sparse --stats --human-readable " as I understand it copies the hard links as they are. I don't find rdfind for tumbleweed and fdupes -L don't looks ok according to the man page too late for today :-) thanks jdd -- https://artdagio.fr
On 2023-11-03 22:59, jdd@dodin.org wrote:
Le 03/11/2023 à 20:18, Carlos E. R. a écrit :
I don't have one yet. This is the question I asked the last week and I'm still studying.
I follox the thread but don't understand all :-(
I have managed to create those links with rsync itself when the filenames match.
is that the command you wrote in your answer?
No. It is complicated, I don't have it in my notes. Similar to rsync $OPTIONS --link-dest=$PREVIO --exclude=/lost+found \ $QUE/ $DESTINO$QUE I have my phone backup: phone/date1/* phone/date2/* So I do: OPTIONS="--archive --acls --xattrs --hard-links --sparse --stats --human-readable --checksum" rsync $OPTIONS phone/date1/ NewPhoneBackup/date1 rsync $OPTIONS --link-dest=NewPhoneBackup/date1 \ phone/date2/ NewPhoneBackup/date2 The new backup will have hardlinks created for identical files in phone/date1/* and phone/date2/* Then I delete both phone/date1/* and phone/date2/*
OPTIONS="--archive --acls --xattrs --hard-links --sparse --stats --human-readable "
These are the options I normally pass onto rsync in my scripts.
as I understand it copies the hard links as they are.
I don't find rdfind for tumbleweed and fdupes -L don't looks ok according to the man page
too late for today :-)
thanks jdd
-- Cheers / Saludos, Carlos E. R. (from openSUSE 15.5 (Laicolasse))
Le 03/11/2023 à 09:52, jdd@dodin.org a écrit :
btrfs deduplication and rsync to other file system...
I did some tests on compression/deduplication on my data, after copying it to an other disk. The results are not that good on a 683Gb file content with mostly text (but obviously not only and probably not the main part), I got only 50g gain by compression :-( night long duperemove made me get an other 20Gb free. Not worth the work. of course, this is only for *my data*, not a general value thanks jdd -- https://artdagio.fr
Hello, In the Message; Subject : Re: btrfs deduplication and rsync [actual tests] Message-ID : <8b87a846-b877-4796-92f5-53c76b52ecb8@dodin.org> Date & Time: Sun, 5 Nov 2023 09:24:15 +0100 [JDD] == "jdd@dodin.org" <jdd@dodin.org> has written: JDD> Le 03/11/2023 à 09:52, jdd@dodin.org a écrit : JDD> > btrfs deduplication and rsync to other file system... JDD> > JDD> I did some tests on compression/deduplication on my data, after JDD> copying it to an other disk. JDD> The results are not that good JDD> on a 683Gb file content with mostly text (but obviously not only JDD> and probably not the main part), I got only 50g gain by JDD> compression :-( It is important to note that the effectiveness of compression and deduplication techniques depends on the type of data being compressed or deduplicated. Text files are already compressed and may not benefit much from further compression. Regards. --- ┏━━┓彡 野宮 賢 mail-to: nomiya @ lake.dti.ne.jp ┃\/彡 ┗━━┛ "Companies have come to view generative AI as a kind of monster that must be fed at all costs—even if it isn’t always clear what exactly that data is needed for or what those future AI systems might end up doing." -- Generative AI Is Making Companies Even More Thirsty for Your Data --
Le 05/11/2023 à 09:44, Masaru Nomiya a écrit :
Text files are already compressed and may not benefit much from further compression.
certainly not. mp4 ot jpg files are already compressed, not raw txt or html files (libreoffice text are probably compressed) jdd -- https://artdagio.fr
Hello, In the Message; Subject : Re: btrfs deduplication and rsync [actual tests] Message-ID : <ebbf3b37-23d0-4979-8801-ecdb83459e92@dodin.org> Date & Time: Sun, 5 Nov 2023 09:51:18 +0100 [jdd] == "jdd@dodin.org" <jdd@dodin.org> has written: jdd> Le 05/11/2023 à 09:44, Masaru Nomiya a écrit : MN> > Text files are already compressed and may not benefit much from MN> > further compression. jdd> certainly not. mp4 ot jpg files are already compressed, not raw jdd> txt or html files (libreoffice text are probably compressed) Sorry, my tongue was not sharp enough. Text files don't have much redundancy, and in your case, there weren't many duplicate files. I guess that's what I meant. For another file system, there is this way; https://www.jeskell.com/blog/jeskell-overview-of-data-deduplication-compress... Regards. --- ┏━━┓彡 野宮 賢 mail-to: nomiya @ lake.dti.ne.jp ┃\/彡 ┗━━┛ "Maddox hopes that empowering users to pick their own algorithms will get them to think more about what’s involved in making them. " -- Bluesky's Custom Algorithms Could Be the Future of Social Media --
Le 05/11/2023 à 10:22, Masaru Nomiya a écrit :
Text files don't have much redundancy, and in your case, there weren't many duplicate files. I guess that's what I meant.
there are, but may be not that much large as I was thinking of. I have lot of html files in archives from a course I made some years ago, with many backups with few files changes. Lot of files. Also mail archives but I guess on modern Tb size disks it's not that large. thunderbird folder was 12Gb recently but not for 1Tb disks thanks jdd -- https://artdagio.fr
participants (4)
-
Andrei Borzenkov
-
Carlos E. R.
-
jdd@dodin.org
-
Masaru Nomiya