Page MenuHomePhabricator

Allow users to export their data from Phacility
Closed, ResolvedPublic

Description

Users should have a reasonable way to export their data.

This is mostly problematic because the dumps may be very large, and we don't have very good infrastructure for handling multi-gigabyte files.

Two possible approaches:

  • Use a daemon to upload things to S3, then serve from S3 directly. I don't love this because it involves a custom policy system (S3) on the most sensitive data.
  • Install a custom file storage engine which just points to backups on disk. When users want to export data, we create a new Files entry on their instance which points at the backup data, and then we could let them download it with a custom SSH handler. This is relatively clean technically (well, we have to proxy, so maybe not that clean) but likely to end up being very bizarre. Too much of this feels weird and we should probably do the S3 thing.
  • Ideal is probably that we improve large file support in Phabricator, then upload the data to their instance (presumably pushing it through to S3, ultimately) and then let them download it. This would use standard auth pathways without being too weird, but we need to get large file support on both the upload and download pathways for it to work.

There's also a security concern: downloading the database backups will include session keys, CSRF tokens, any stored private keys, etc. We probably can't reasonably strip this data, although we could consider doing so in some cases. But "Export Data" needs to be tightly restricted.

Revisions and Commits

Restricted Differential Revision
Restricted Differential Revision
Restricted Differential Revision
Restricted Differential Revision
Restricted Differential Revision
Restricted Differential Revision
Restricted Differential Revision
Restricted Differential Revision
Restricted Differential Revision
Restricted Differential Revision
rARC Arcanist
D16408
D14075
D14056
rP Phabricator
D16426
D14055

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Uploading the snapshot into the user's instance (which is being backed up) feels very meta; It will need to include past snapshots too...

Assuming the use case is "migrating away" and not "backup", I think it would be reasonable to:

  • Strip all user access credentials/tokens - IMO, security concerns win over convenience here.
  • Have the snapshot download uri only alive for a short time (24 hours? nonce?)

But I'd like

  • A (rough) estimate of the file size before I start,
  • Maybe the ability to destroy/exclude Files? There are items in there which users might not expect (Thumbnails, images pasted into comments and never submitted, something from diffusion?).

You can review available backups and their filesizes from the instance console:

Screen_Shot_2015-02-17_at_10.45.56_PM.png (227×1 px, 61 KB)

This won't automatically include S3 data after T7163, though.

If users want to backup their data themselves I probably wouldn't want to do the upload-to-instance thing, but I'm assuming (perhaps incorrectly) that most installs opting for SAAS aren't going to be super interested in running some local backup-backup program and that this would be a rare/manual action. We communicate a lot of information about the existence and scheduling of backups so it isn't a black box, either.

Now that we have large file support, a better approach is probably:

  • User clicks some "Export" button.
  • We use background tasks to arc upload backups from repo and db hosts to admin. After T5166, this can have a better UI.
  • The user downloads them from admin.

However, all the security stuff is still live.

Manual runbook on this until we formalize the process, for hypothetical instance turtle:

  • Maybe drop caches with PHABRICATOR_INSTANCE=turtle bin/cache purge --purge-all to reduce the amount of data we're shipping around.
  • Dump the database with PHABRICATOR_INSTANCE=turtle bin/storage dump | gzip > turtle.sql.gz.
  • Move the dump to a secure location (there isn't a great command for this right now, but you can get there with a couple of hops of scp) and uncompress it.
  • Load it with mysql -uroot < turtle.sql or similar. You may need to bin/storage destroy --namespace turtle first if you're repeating this process.

Then, adjust the data. First, truncate sensitive and phacility_specific tables:

USE turtle_auth;
TRUNCATE auth_providerconfig;
TRUNCATE auth_providerconfigtransaction;
TRUNCATE auth_temporarytoken;

USE turtle_user;
TRUNCATE user_authinvite;
TRUNCATE user_externalaccount;

Now, truncate the Almanac tables.

IMPORTANT: Once Almanac comes out of prototype, user data could be stored here! Don't TRUNCATE these tables after it unprototypes.
USE turtle_almanac;
TRUNCATE almanac_binding;
TRUNCATE almanac_bindingtransaction;
TRUNCATE almanac_device;
TRUNCATE almanac_devicetransaction;
TRUNCATE almanac_interface;
TRUNCATE almanac_network;
TRUNCATE almanac_networktransaction;
TRUNCATE almanac_property;
TRUNCATE almanac_service;
TRUNCATE almanac_servicetransaction;
TRUNCATE edge;
TRUNCATE edgedata;

We can also get rid of the daemon information:

USE turtle_daemon;
TRUNCATE daemon_log;
TRUNCATE daemon_logevent;

Then drop repositories out of cluster/service mode:

USE turtle_repository;
UPDATE repository SET almanacServicePHID = null;

Probably good to also set localPath on any repositories to the standard location:

<?php

require_once 'scripts/__init_script__.php';

foreach (new LiskMigrationIterator(new PhabricatorRepository()) as $repo) {
  $details = $repo->getDetails();
  $old_path = $details['local-path'];
  $new_path = '/var/repo/'.$repo->getCallsign().'/';

  $details['local-path'] = $new_path;

  queryfx(
    $repo->establishConnection('w'),
    'UPDATE %T SET details = %s WHERE id = %d',
    $repo->getTableName(),
    json_encode($details),
    $repo->getID());

  echo $old_path.' -> '.$new_path;
  echo "\n";
}

Finally;

  • Dump the database again (bin/storage dump > turtle.sql).
  • Use bin/storage renamespace --from turtle --to phabricator --in turtle.sql > turtle_phabricator.sql to switch it to the Phabricator namespace.
  • Compress it with gzip.
  • Deliver it to the customer.
epriestley triaged this task as Normal priority.Sep 2 2015, 7:19 PM

I think the automated export flow probably looks something like this:

  1. User clicks "Export" next to a backup.
  2. We queue a daemon task to perform the export.
  3. This task connects to the host where the backup volume is mounted and copies the file to a host on a dedicated export tier.
  4. Then it connects to the export host and renamespaces, loads, drops caches, strips, dumps, normalizes, and compresses the backup, producing an export.
  5. It uploads the export to admin with the view policy set to the requesting user.

Step (3) is messy because hosts have no direct SSH access to one another (access always goes through the bastion host). It's probably cleaner to just ship the file up to admin and then download it onto the export host. This is effectively the same number of steps, reuses more infrastructure, and allows us to maintain a stronger host isolation model. (We should configure S3 on admin before pursuing this, though.)

Step (5) should definitely operate by just shipping the file up to admin.

Managing the transfer ourselves also potentially allows us to do progress bars and stuff too, and building bin/host upload and bin/host download would clean up the mess of indirect scp invocations that large file transfer in the cluster currently involves.

epriestley added a revision: Restricted Differential Revision.Sep 4 2015, 8:41 PM
epriestley added a commit: Restricted Diffusion Commit.Sep 4 2015, 9:07 PM
epriestley added a commit: Restricted Diffusion Commit.Sep 7 2015, 3:48 PM

This is getting fairly close to working locally but isn't going to make it into this week's release. I want to let the changes in connection with T9307 settle first, and am still unsatisfied with the permission model (currently: anyone who can exit the instance can perform exports) because this capability represents a huge increase in power over any other capability.

This has been slightly streamlined, modern flow is to dump first:

$ PHABRICATOR_INSTANCE=turtle /core/lib/phabricator/bin/cache purge --purge-all
$ PHABRICATOR_INSTANCE=turtle /core/lib/phabricator/bin/storage dump | gzip > turtle.sql.gz
$ /core/bin/host upload --file turtle.sql.gz

Now connect to a host in the aux tier to actually process the dump:

$ /core/bin/host download --phid <phid> --save-as turtle.sql.gz
$ gunzip turtle.sql.gz
$ PHABRICATOR_INSTANCE=turtle /core/lib/phabricator/bin/storage destroy
$ msyql -uroot < turtle.sql
$ PHABRICATOR_INSTANCE=turtle /core/lib/phabricator/bin/files migrate --engine blob --all
$ PHABRICATOR_INSTANCE=turtle /core/lib/services/bin/services export-wipe

Do a bin/repository list-paths and bin/repository move-paths -- this is a little irregular in cases right now so I didn't automate it.

Then export:

$ PHABRICATOR_INSTANCE=turtle /core/lib/phabricator/bin/storage dump > turtle-export.sql
$ PHABRICATOR_INSTANCE=turtle /core/lib/phabricator/bin/storage renamespace --from turtle --to phabricator --in turtle-export.sql > turtle-out.sql 
$ gzip turtle-out.sql
$ /core/bin/host upload --file turtle-out.sql.gz

Then go manually adjust file permissions on admin.

So this is still a bit manual, but moving toward being automated.

epriestley added a revision: Restricted Differential Revision.Aug 17 2016, 2:05 PM
epriestley added a revision: Restricted Differential Revision.Aug 17 2016, 3:16 PM
epriestley added a revision: Restricted Differential Revision.Aug 17 2016, 3:29 PM
epriestley added a commit: Restricted Diffusion Commit.Aug 17 2016, 8:19 PM

After D16427, I have an automated version of this working locally. It needs some more UI polish and I'm not going to try to get it in this week, but the support code (services dump / services export) will go out and then this can maybe go out next week or so.

For now, all users with billing account access will be able to do exports, but they email all other members of the billing account when started and completed. This seems like a reasonable compromise until we build out more powerful/granular policy mechanisms in T9515 / T11256. Although I want to make sure we have a good story about resisting attacks from privileged users in the long run, I believe this isn't much of a concern to existing customers today.

epriestley added a commit: Restricted Diffusion Commit.Aug 20 2016, 10:57 PM
epriestley added a commit: Restricted Diffusion Commit.

I used the new support code in connection with a manual instance rename (see T11413) and it appears to have worked.

greenhatman updated the task description. (Show Details)
greenhatman updated the task description. (Show Details)
greenhatman added a subscriber: greenhatman.

Modern flow is:

  • Connect to the database shard.
  • bin/host dump --instance turtle
  • Copy the file PHID.
  • Connect to aux001.
  • bin/host export --instance turtle --database <phid>
  • Copy the file PHID.
  • Connect to admin001.
  • Manually set the file authorPHID to yourself, VERY VERY CAREFULLY.
  • In the web UI, set "Visible To" to you and the relevant user account.

In PHI311, I think we hit a race like this:

  • (Observation) A ~128KB file with data stored in S3 was captured by the dump.
  • (Assumption) It was a temporary file, and after it dumped, the GC deleted it.
  • (Observation) The export process tried to pull the data from S3, but S3 returned a 403.

(I'm not sure this is actually what happened, or why S3 returned a 403 instead of a 404.)

403 instead of a 404

I think this is expected:

If the object you request does not exist, the error Amazon S3 returns depends on whether you also have the s3:ListBucket permission.

  • If you have the s3:ListBucket permission on the bucket, Amazon S3 will return a HTTP status code 404 ("no such key") error.
  • if you don’t have the s3:ListBucket permission, Amazon S3 will return a HTTP status code 403 ("access denied") error.

Via https://docs.aws.amazon.com/AmazonS3/latest/API/RESTObjectHEAD.html#rest-object-head-permissions.


I think we hit a race like this:

See PHI769 for another case of this; it's reproducible this time. The file we're hitting an issue with is consistent with the theory above.

I'm going to just dodge this whole issue by skipping migration operations for expired files: even if the data is still there, there's no reason to try to migrate them.

I cherry-picked D19536, deployed it to aux001, and restarted the file data copy, which worked this time. I'm going to restart the export step and hopefully we're off to the races this time.

epriestley added a revision: Restricted Differential Revision.Jul 26 2018, 5:39 PM
epriestley added a revision: Restricted Differential Revision.
epriestley added a commit: Restricted Diffusion Commit.Jul 26 2018, 7:14 PM
epriestley added a commit: Restricted Diffusion Commit.
epriestley added a revision: Restricted Differential Revision.Jul 30 2019, 6:18 PM
epriestley added a revision: Restricted Differential Revision.Jul 30 2019, 6:20 PM
epriestley added a commit: Restricted Diffusion Commit.Jul 30 2019, 6:21 PM
epriestley added a commit: Restricted Diffusion Commit.Jul 30 2019, 6:21 PM
epriestley added a commit: Restricted Diffusion Commit.
epriestley claimed this task.

See T13656 for followup.