Support user-specified expiration times #72
Loading…
Reference in a new issue
No description provided.
Delete branch ":expiration"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Closes #33
This pull request adds support for an additional field during file uploads:
expires=<time>
. This field can be set to either:Neither option allows setting an expiration time longer than what would naturally be allotted to the file. In other words, this allows users to shorten the retention time of their files, but not extend it.
In order to accomadate this change, the following changes were necessary:
expiration
field to thefile
table in the database. Prior to this change, expiration dates were tracked using filesystem metadata. Unfortunately, I do not know a way to use filesystem metadata to store another date/duration to expire the file at, so it was necessary to replace this method of tracking expirations with the databasecleanup.py
script needed to be overhauled. Because it now needs database connectivity, I elected to move it into the main flask app as an alternate entrypoint, which can now be executed withFLASK_APP=fhost flask prune
. All files uploaded after the migration will have expiration times attached that the script will use to determine which files should be removed. However, files created prior to the migration will not have expirations associated. To address this, the script has an option (--legacy
) which supplements the new-style exiration checking with the old-style expiration checking, which is sufficient to clean up legacy files as they expire.cleanup.py
has been replaced with a deprecation notice informing users to switch to the new script, and to use the--legacy
flag.FHOST_MIN_EXPIRATION
andFHOST_MAX_EXPIRATION
, each of which accept a duration in milliseconds, and parameterize the amount of time files last before expiring. Previously, these values were hardcoded in the cleanup script.Please let me know what you think, and if there's any changes you think should be made!
Oh, nice work! Thank you!
But how about instead of recording the absolute expiration date, record both creation date + client-requested offset (if any) in the database? That allows changing the expiration settings (or the entire algorithm) at a later point, as well as more practical database queries. I actually want to stop relying on filesystem metadata.
Also, maybe populate the missing database fields during migration so we don’t need the
--legacy
flag?Support user-specified expiration timesto WIP: Support user-specified expiration timesThat makes sense!
I'll do my best to figure out populating database fields! I originally considered it, but got scared off because I wasn't entirely sure how to work with the database environment that alembic executes in (and without the sqlalchemy models), but I'll see if I can work it out!
Also the reupload case with existing files is kinda ugly and would allow bad actors to go around changing dates on random files. Some form of auth mechanism would be nice to have there, but I’d say that’s out of scope for this PR.
Not sure what to do in the meantime. How about this: If an expiration offset has been set already, do not modify the
uploaded_at
(or whatever you’d call it) timestamp or the offset, to enforce expiry at the time set by the original uploader. Only do that once the file has vanished.I'm not entirely sure I follow? Bad actors could theoretically extend expiration dates, but they wouldn't be able to shorten it, which is the same current behaviour anyway.
I guess if a user needs the file to be deleted after a certain point, that could be a problem, but I feel like 0x0 can't offer that guarantee of safety anyway, since files aren't even guaranteed to be deleted at their expirations, and can easily end up on archive.org or other sites anyway.
Hm fair enough. Might not be an issue then for this feature 🤷
Alright so I thought about this a little bit, and I think you were right that there's a dilemma with the reupload case. Specifically, I'm thinking of this case, where a re-upload comes shortly before a change in the expiration parameters:
I can think of four different behaviours the server could respond with:
Behavior 1
At T+00h, the file has a calculated expiration of T+24h (requested). When the file is re-uploaded, the file still has an expiration of T+24h, since the new requested expiration (T+18h) is earlier than the old requested expiration, however the new upload time and requested expiration are still stored. At T+14h, when the expiration parameters change, the server recognizes that no more time is owed to the first upload, and the later upload's requested expiration is used, so that the calculated expiration is T+18h.
Implementation aside, I think this is mathematically ideal behaviour for a system which allows retroactive application of expiration parameters. It ensures that files are expired exactly as soon as is permissible. However, in order to implement it, we'd need to store a full history of upload+requested expiration timestamps, as opposed to just the most recent, which would also require significant restructuring of the database, since this would violate existing uniqueness constraints.
Behavior 2
At T+00h, the file has a calculated expiration of T+24h. When the file is re-uploaded, the server observes that the requested expiration is before the existing expiration. The server updates the most recent upload date of file to T+12h, but does not change the expiration date from T+24h (or equivalently, changes the requested duration to 12h, preserving the T+24h expiration date). When the server changes the expiration parameters, the server observes that the natural expiration (now T+20h thanks to the updated upload time) is before the recorded requested expiration (T+24h). The file expires at T+20h.
I see this behavior as a compromise of behavior 1. It preserves the ability to retroactively apply expiration parameters and extend expiration dates without needing to track historic upload times at the cost of efficiency. Specifically, this file could be culled at T+18h, but the algorithm protects it until T+20h.
As far as database requirements, this requires storing a minimum of two timestamps - most recent upload and requested expiration
Behavior 3
When the file is uploaded, the uploader is guaranteed an expiration no sooner than T+24h. When the file is re-uploaded, the server observes that the requested expiration is sooner than the current expiration, and the existing expiration is preserved. This client is also guaranteed an expiration no sooner than T+24h. When the server changes expiration parameters, they are not retroactively applied to existing files. The file expires at T+24h.
This behavior sacrifices retroactive application of expiration parameters, and in exchange, is able to guarantee a file's minimum lifetime at upload time. This is the behavior as it is currently implemented in this PR. It requires storing a minimum of one timestamp - the latest guaranteed expiration. I'll also admit a bit of a bias on my part towards this solution, because the stable expiration time is something I like.
Behavior 4
When the file is uploaded, the calculated expiration is T+24h, and re-uploads are not permitted (maybe only until file expiration?). When the file is re-uploaded, the request is simply denied, either with a no-op
200 Okay
or a4XX
. When the expiration parameters are changed, they apply to the file, and it expires immediately at T+14h.This is my interpretation of your above comment, although I may have misinterpreted you? It requires storing two timestamps - original upload and original requested expiration. I'm personally not sure that this use case is something we should / are able to support for the reasons I described above, but I figured it was still worth mentioning as a possible behavior.
Discussion
One thing I think that's worth noting is that none of this behavior applies except in a bit of an edge case where a file is being reuploaded shortly before a change in the expiration parameters. Obviously this only affects a very very small number of users and servers, so it's not a very consequential decision, although it will considerably affect how this PR is developed.
The one exception to this is Behavior 4, which would impact expiration behavior outside of this edgecase.
Honestly, I wouldn't be writing such a long comment in discussion of this if it weren't for the fact that I'm currently trapped on a 12h train trip without WiFi or reception, and nothing to do but read PL papers and type excruciatingly long comments in .txt files. That said, I think I'm out of steam now anyway, so I'll wrap it up.
Like I said, my personal preference is behavior three, but I'm willing to implement any of these that you deem best for the project. Lemme know what you think, and thanks for tolerating my boredom-induced overthinking!
f507de8
adds support for calculating an expiration date for files during the migration. It's quite a bit jenky and I think it might violate some of the principles of alembic's migration system, but I couldn't really figure out a better way of doing it, although please let me know if you do.Still on the TODO:
I think behavior 3 is fine. Changing the expiration logic is more of an edge case after all. Thank you for putting this much thought into it!
I’ve taken a look at the migration script and added some review comments. Let me know what you think.
@ -367,0 +463,4 @@
File.expiration.is_not(None),
File.expiration < current_time
)
).all()
.all()
can be omitted here—unless you want to add a fancy progress bar withtqdm
?Noted! Will do! I think I'll leave out the fancy progress bar because I expect that this script (or at least the part of it that does more than add a new column to an empty database) will run a scarce few times, and there's really only a few existing servers where I expect it to take more than a couple seconds.
Oh wait I got mixed up, and thought this comment was on the migration script. Still, I doubt it'll take very long to run, even on larger servers.
11cfd07
@ -0,0 +38,4 @@
db = SQLAlchemy(current_app.__weakref__())
# Representations of the original and updated File tables
class File(db.Model):
It might be possible to use SQLAlchemy’s automap extension to reflect the schema as it exists. A minimal example I have not tested:
Ooooooo! I hadn't heard of that before! Thanks for bringing it up, I'll take a look!
Resolved in
d14713d
@ -0,0 +65,4 @@
# Calculate an expiration date for all existing files
files = File.query\
.where(
sa.not_(File.removed)
You can change this query to:
File.query.where(sa.not_(File.removed), File.sha256.in_(unexpired_files))
, also omitting the.all()
. This is much faster on my instance, which has over a million results for the unmodified query. That’s a lot of objects we don’t need!I deliberately chose not to do that here because I'm worried that the query will become too large, and I wasn't sure whether or not sqlalchemy has provisions built in to handle that case.
I've had trouble before with very large
in y
queries when working directly with the sqlite library, and I didn't want to chance it. If you think it's safe though, I'll make the change.I think that should be safe unless the SQL statement somehow grows larger than a gigabyte (SQLite’s default limit), which seems unlikely.
Resolved in
19d989b
@ -0,0 +73,4 @@
stat = os.stat(file_path)
max_age = get_max_lifespan(stat.st_size) # How long the file is allowed to live, in ms
file_birth = stat.st_mtime * 1000 # When the file was created, in ms
op.execute(
Instead of doing this here, it’s probably faster to store the modified fields:
and execute a bulk update at the end:
Makes sense! I'll implement it when I get out of classes later today.
Resolved in
55ee374
touch()
39d24e56c3How does that look?
@ -0,0 +40,4 @@
db = SQLAlchemy(current_app.__weakref__())
# Representation of the updated (future) File table
UpdatedFile = sa.table('file',
Is this still needed?
Nope! Missed that!
60db793
Starting to look good! Think I’ll test this in a bit.
@ -0,0 +31,4 @@
Value returned is a duration in milliseconds.
"""
def get_max_lifespan(filesize: int) -> int:
Note to self: Move utility functions like these to separate files…
WIP: Support user-specified expiration timesto Support user-specified expiration timesAlright, did a quick test run and it seems to be working as expected. Squashed and merged as
af4b3b06c0
. Thank you so much!