Allow users to export their data from Phacility
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	epriestley
	Feb 4 2015, 2:09 PM

Description

Users should have a reasonable way to export their data.

This is mostly problematic because the dumps may be very large, and we don't have very good infrastructure for handling multi-gigabyte files.

Two possible approaches:

Use a daemon to upload things to S3, then serve from S3 directly. I don't love this because it involves a custom policy system (S3) on the most sensitive data.
Install a custom file storage engine which just points to backups on disk. When users want to export data, we create a new Files entry on their instance which points at the backup data, and then we could let them download it with a custom SSH handler. This is relatively clean technically (well, we have to proxy, so maybe not that clean) but likely to end up being very bizarre. Too much of this feels weird and we should probably do the S3 thing.
Ideal is probably that we improve large file support in Phabricator, then upload the data to their instance (presumably pushing it through to S3, ultimately) and then let them download it. This would use standard auth pathways without being too weird, but we need to get large file support on both the upload and download pathways for it to work.

There's also a security concern: downloading the database backups will include session keys, CSRF tokens, any stored private keys, etc. We probably can't reasonably strip this data, although we could consider doing so in some cases. But "Export Data" needs to be tightly restricted.

Revisions and Commits

	Restricted Differential Revision	Restricted Diffusion Commit
	Restricted Differential Revision	Restricted Diffusion Commit
	Restricted Differential Revision	Restricted Diffusion Commit
	Restricted Differential Revision	Restricted Diffusion Commit
	Restricted Differential Revision	Restricted Diffusion Commit
		Restricted Diffusion Commit
	Restricted Differential Revision	Restricted Diffusion Commit
	Restricted Differential Revision	Restricted Diffusion Commit
	Restricted Differential Revision	Restricted Diffusion Commit
	Restricted Differential Revision	Restricted Diffusion Commit
		Restricted Diffusion Commit
	Restricted Differential Revision	Restricted Diffusion Commit
rARC Arcanist
	D16408	rARC6507be27aeb9 Fix handling of View Policy in CLI upload workflow for small, unique files
	D14075	rARC55d9cc701316 Improve temporary file upload API and add viewPolicy support
	D14056	rARC989690868513 Add temporary file support to ArcanistFileUploader
rP Phabricator
	D16426	rPfcb20cb79989 Add a "--force" flag to "bin/repository move-paths"
	D14055	rP6f372943dbef Add support for temporary files to `file.allocate`

Related Objects
Search...

Status	Assigned	Task
Resolved	epriestley	T7147 Backup repository data in the Phacility cluster
Resolved	epriestley	T7146 Improve administrative UI around backup management
Open	None	T9303 Improve Phacility Onboarding/NUX
Resolved	epriestley	T7148 Allow users to export their data from Phacility
Resolved	epriestley	T9373 Deploy an "aux" tier in the cluster
Resolved	epriestley	T6996 Write an `--output <file>` mode for `storage dump` which can gzip
Resolved	epriestley	T11596 Phacility export process removes data from S3 which may still be referenced

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Uploading the snapshot into the user's instance (which is being backed up) feels very meta; It will need to include past snapshots too...

Assuming the use case is "migrating away" and not "backup", I think it would be reasonable to:

Strip all user access credentials/tokens - IMO, security concerns win over convenience here.
Have the snapshot download uri only alive for a short time (24 hours? nonce?)

But I'd like

A (rough) estimate of the file size before I start,
Maybe the ability to destroy/exclude Files? There are items in there which users might not expect (Thumbnails, images pasted into comments and never submitted, something from diffusion?).

You can review available backups and their filesizes from the instance console:

This won't automatically include S3 data after T7163, though.

If users want to backup their data themselves I probably wouldn't want to do the upload-to-instance thing, but I'm assuming (perhaps incorrectly) that most installs opting for SAAS aren't going to be super interested in running some local backup-backup program and that this would be a rare/manual action. We communicate a lot of information about the existence and scheduling of backups so it isn't a black box, either.

epriestley mentioned this in T7146: Improve administrative UI around backup management.Feb 20 2015, 3:56 PM

epriestley added a parent task: T7146: Improve administrative UI around backup management.

Now that we have large file support, a better approach is probably:

User clicks some "Export" button.
We use background tasks to arc upload backups from repo and db hosts to admin. After T5166, this can have a better UI.
The user downloads them from admin.

However, all the security stuff is still live.

epriestley mentioned this in T7648: Better support for migrating to new host.Mar 27 2015, 2:36 PM

Manual runbook on this until we formalize the process, for hypothetical instance turtle:

Maybe drop caches with PHABRICATOR_INSTANCE=turtle bin/cache purge --purge-all to reduce the amount of data we're shipping around.
Dump the database with PHABRICATOR_INSTANCE=turtle bin/storage dump | gzip > turtle.sql.gz.
Move the dump to a secure location (there isn't a great command for this right now, but you can get there with a couple of hops of scp) and uncompress it.
Load it with mysql -uroot < turtle.sql or similar. You may need to bin/storage destroy --namespace turtle first if you're repeating this process.

Then, adjust the data. First, truncate sensitive and phacility_specific tables:

USE turtle_auth;
TRUNCATE auth_providerconfig;
TRUNCATE auth_providerconfigtransaction;
TRUNCATE auth_temporarytoken;

USE turtle_user;
TRUNCATE user_authinvite;
TRUNCATE user_externalaccount;

Now, truncate the Almanac tables.

IMPORTANT: Once Almanac comes out of prototype, user data could be stored here! Don't TRUNCATE these tables after it unprototypes.

USE turtle_almanac;
TRUNCATE almanac_binding;
TRUNCATE almanac_bindingtransaction;
TRUNCATE almanac_device;
TRUNCATE almanac_devicetransaction;
TRUNCATE almanac_interface;
TRUNCATE almanac_network;
TRUNCATE almanac_networktransaction;
TRUNCATE almanac_property;
TRUNCATE almanac_service;
TRUNCATE almanac_servicetransaction;
TRUNCATE edge;
TRUNCATE edgedata;

We can also get rid of the daemon information:

USE turtle_daemon;
TRUNCATE daemon_log;
TRUNCATE daemon_logevent;

Then drop repositories out of cluster/service mode:

USE turtle_repository;
UPDATE repository SET almanacServicePHID = null;

Probably good to also set localPath on any repositories to the standard location:

<?php

require_once 'scripts/__init_script__.php';

foreach (new LiskMigrationIterator(new PhabricatorRepository()) as $repo) {
  $details = $repo->getDetails();
  $old_path = $details['local-path'];
  $new_path = '/var/repo/'.$repo->getCallsign().'/';

  $details['local-path'] = $new_path;

  queryfx(
    $repo->establishConnection('w'),
    'UPDATE %T SET details = %s WHERE id = %d',
    $repo->getTableName(),
    json_encode($details),
    $repo->getID());

  echo $old_path.' -> '.$new_path;
  echo "\n";
}

Finally;

Dump the database again (bin/storage dump > turtle.sql).
Use bin/storage renamespace --from turtle --to phabricator --in turtle.sql > turtle_phabricator.sql to switch it to the Phabricator namespace.
Compress it with gzip.
Deliver it to the customer.

epriestley triaged this task as Normal priority.Sep 2 2015, 7:19 PM

epriestley moved this task from Do After Launch to Onboarding/NUX on the Phacility board.Sep 2 2015, 7:21 PM

epriestley mentioned this in T9303: Improve Phacility Onboarding/NUX.Sep 2 2015, 7:35 PM

I think the automated export flow probably looks something like this:

User clicks "Export" next to a backup.
We queue a daemon task to perform the export.
This task connects to the host where the backup volume is mounted and copies the file to a host on a dedicated export tier.
Then it connects to the export host and renamespaces, loads, drops caches, strips, dumps, normalizes, and compresses the backup, producing an export.
It uploads the export to admin with the view policy set to the requesting user.

Step (3) is messy because hosts have no direct SSH access to one another (access always goes through the bastion host). It's probably cleaner to just ship the file up to admin and then download it onto the export host. This is effectively the same number of steps, reuses more infrastructure, and allows us to maintain a stronger host isolation model. (We should configure S3 on admin before pursuing this, though.)

Step (5) should definitely operate by just shipping the file up to admin.

Managing the transfer ourselves also potentially allows us to do progress bars and stuff too, and building bin/host upload and bin/host download would clean up the mess of indirect scp invocations that large file transfer in the cluster currently involves.

epriestley added a parent task: T9303: Improve Phacility Onboarding/NUX.Sep 2 2015, 7:48 PM

avivey removed a subscriber: avivey.Sep 2 2015, 7:55 PM

epriestley mentioned this in T9309: Minor Phacility onboarding UX gripes.Sep 3 2015, 1:19 PM

epriestley added a revision: D14055: Add support for temporary files to `file.allocate`.Sep 4 2015, 3:34 PM

epriestley added a revision: D14056: Add temporary file support to ArcanistFileUploader.

epriestley added a commit: rARC989690868513: Add temporary file support to ArcanistFileUploader.Sep 4 2015, 5:34 PM

epriestley added a commit: rP6f372943dbef: Add support for temporary files to `file.allocate`.

epriestley added a revision: Restricted Differential Revision.Sep 4 2015, 8:41 PM

epriestley added a commit: Restricted Diffusion Commit.Sep 4 2015, 9:07 PM

epriestley added a revision: D14075: Improve temporary file upload API and add viewPolicy support.Sep 7 2015, 3:29 PM

epriestley added a commit: Restricted Diffusion Commit.Sep 7 2015, 3:48 PM

epriestley added a commit: rARC55d9cc701316: Improve temporary file upload API and add viewPolicy support.Sep 7 2015, 7:45 PM

This is getting fairly close to working locally but isn't going to make it into this week's release. I want to let the changes in connection with T9307 settle first, and am still unsatisfied with the permission model (currently: anyone who can exit the instance can perform exports) because this capability represents a huge increase in power over any other capability.

epriestley closed subtask T9373: Deploy an "aux" tier in the cluster as Resolved.Sep 17 2015, 12:55 PM

epriestley mentioned this in T182: Commit into repository directly from differential.Oct 6 2015, 1:20 AM

epriestley mentioned this in T9515: Build a "quorum" mechanism and associated UI.Oct 6 2015, 2:26 AM

epriestley mentioned this in T9570: Can a hosted Phacility instance be exported?.Oct 14 2015, 9:04 PM

jasonfsmitty added a subscriber: jasonfsmitty.Oct 16 2015, 1:55 AM

jasonfsmitty removed a subscriber: jasonfsmitty.Oct 19 2015, 6:20 PM

mericsson added a subscriber: mericsson.Jan 23 2016, 7:25 PM

This has been slightly streamlined, modern flow is to dump first:

$ PHABRICATOR_INSTANCE=turtle /core/lib/phabricator/bin/cache purge --purge-all
$ PHABRICATOR_INSTANCE=turtle /core/lib/phabricator/bin/storage dump | gzip > turtle.sql.gz
$ /core/bin/host upload --file turtle.sql.gz

Now connect to a host in the aux tier to actually process the dump:

$ /core/bin/host download --phid <phid> --save-as turtle.sql.gz
$ gunzip turtle.sql.gz
$ PHABRICATOR_INSTANCE=turtle /core/lib/phabricator/bin/storage destroy
$ msyql -uroot < turtle.sql
$ PHABRICATOR_INSTANCE=turtle /core/lib/phabricator/bin/files migrate --engine blob --all
$ PHABRICATOR_INSTANCE=turtle /core/lib/services/bin/services export-wipe

Do a bin/repository list-paths and bin/repository move-paths -- this is a little irregular in cases right now so I didn't automate it.

Then export:

$ PHABRICATOR_INSTANCE=turtle /core/lib/phabricator/bin/storage dump > turtle-export.sql
$ PHABRICATOR_INSTANCE=turtle /core/lib/phabricator/bin/storage renamespace --from turtle --to phabricator --in turtle-export.sql > turtle-out.sql 
$ gzip turtle-out.sql
$ /core/bin/host upload --file turtle-out.sql.gz

Then go manually adjust file permissions on admin.

So this is still a bit manual, but moving toward being automated.

Herald added a subscriber: eadler. · View Herald TranscriptMay 13 2016, 4:19 PM

epriestley mentioned this in T11256: Plan how to split apart instance permissions.Jul 1 2016, 10:03 PM

epriestley added a subtask: T6996: Write an `--output <file>` mode for `storage dump` which can gzip.Aug 17 2016, 1:10 PM

epriestley added a revision: D16408: Fix handling of View Policy in CLI upload workflow for small, unique files.Aug 17 2016, 2:00 PM

epriestley added a revision: Restricted Differential Revision.Aug 17 2016, 2:05 PM

epriestley mentioned this in D16410: Provide "--output" flags for "bin/storage renamespace".Aug 17 2016, 3:12 PM

epriestley added a revision: Restricted Differential Revision.Aug 17 2016, 3:16 PM

epriestley added a revision: Restricted Differential Revision.Aug 17 2016, 3:29 PM

epriestley added a commit: rARC6507be27aeb9: Fix handling of View Policy in CLI upload workflow for small, unique files.Aug 17 2016, 4:04 PM

epriestley mentioned this in rP2c5b1dc20a6b: Provide "--output" flags for "bin/storage renamespace".

epriestley added a commit: Restricted Diffusion Commit.

epriestley added a commit: Restricted Diffusion Commit.Aug 17 2016, 8:19 PM

epriestley added a revision: D16426: Add a "--force" flag to "bin/repository move-paths".Aug 20 2016, 9:08 PM

epriestley added a revision: Restricted Differential Revision.

epriestley added a commit: rPfcb20cb79989: Add a "--force" flag to "bin/repository move-paths".Aug 20 2016, 9:10 PM

After D16427, I have an automated version of this working locally. It needs some more UI polish and I'm not going to try to get it in this week, but the support code (services dump / services export) will go out and then this can maybe go out next week or so.

For now, all users with billing account access will be able to do exports, but they email all other members of the billing account when started and completed. This seems like a reasonable compromise until we build out more powerful/granular policy mechanisms in T9515 / T11256. Although I want to make sure we have a good story about resisting attacks from privileged users in the long run, I believe this isn't much of a concern to existing customers today.

epriestley added a commit: Restricted Diffusion Commit.Aug 20 2016, 10:57 PM

epriestley added a commit: Restricted Diffusion Commit.

I used the new support code in connection with a manual instance rename (see T11413) and it appears to have worked.

chad awarded a token.Aug 21 2016, 12:06 AM

• JustinTulloss added a subscriber: • JustinTulloss.Aug 23 2016, 4:11 PM

epriestley added a subtask: T11596: Phacility export process removes data from S3 which may still be referenced.Sep 6 2016, 2:25 PM

epriestley closed subtask T11596: Phacility export process removes data from S3 which may still be referenced as Resolved.Sep 6 2016, 8:59 PM

holmboe added a subscriber: holmboe.Nov 11 2016, 12:29 AM

epriestley closed subtask T6996: Write an `--output <file>` mode for `storage dump` which can gzip as Resolved.Dec 13 2016, 11:38 AM

greenhatman updated the task description. (Show Details)Feb 2 2017, 6:13 PM

greenhatman updated the task description. (Show Details)

greenhatman added a subscriber: greenhatman.

gabe added a subscriber: gabe.Feb 26 2017, 5:24 AM

wjiang added a subscriber: wjiang.Jun 6 2017, 12:56 PM

briankc added a subscriber: briankc.Jun 15 2017, 5:04 AM

jacktrades added a subscriber: jacktrades.Jul 17 2017, 7:03 PM

Modern flow is:

Connect to the database shard.
bin/host dump --instance turtle
Copy the file PHID.
Connect to aux001.
bin/host export --instance turtle --database <phid>
Copy the file PHID.
Connect to admin001.
Manually set the file authorPHID to yourself, VERY VERY CAREFULLY.
In the web UI, set "Visible To" to you and the relevant user account.

epriestley mentioned this in T13043: Improve authentication revocation behaviors.Jan 20 2018, 4:42 PM

In PHI311, I think we hit a race like this:

(Observation) A ~128KB file with data stored in S3 was captured by the dump.
(Assumption) It was a temporary file, and after it dumped, the GC deleted it.
(Observation) The export process tried to pull the data from S3, but S3 returned a 403.

(I'm not sure this is actually what happened, or why S3 returned a 403 instead of a 404.)

403 instead of a 404

I think this is expected:

If the object you request does not exist, the error Amazon S3 returns depends on whether you also have the s3:ListBucket permission.

If you have the s3:ListBucket permission on the bucket, Amazon S3 will return a HTTP status code 404 ("no such key") error.

if you don’t have the s3:ListBucket permission, Amazon S3 will return a HTTP status code 403 ("access denied") error.

Via https://docs.aws.amazon.com/AmazonS3/latest/API/RESTObjectHEAD.html#rest-object-head-permissions.

I think we hit a race like this:

See PHI769 for another case of this; it's reproducible this time. The file we're hitting an issue with is consistent with the theory above.

I'm going to just dodge this whole issue by skipping migration operations for expired files: even if the data is still there, there's no reason to try to migrate them.

epriestley mentioned this in D19536: When migrating files between storage engines with "bin/files migrate ...", skip expired temporary files.Jul 25 2018, 11:21 PM

epriestley mentioned this in rP682c3bc9ee56: When migrating files between storage engines with "bin/files migrate ...", skip….Jul 26 2018, 1:22 PM

epriestley mentioned this in rP012ace763422: When migrating files between storage engines with "bin/files migrate ...", skip….

I cherry-picked D19536, deployed it to aux001, and restarted the file data copy, which worked this time. I'm going to restart the export step and hopefully we're off to the races this time.

epriestley added a revision: Restricted Differential Revision.Jul 26 2018, 5:39 PM

epriestley added a revision: Restricted Differential Revision.

epriestley added a commit: Restricted Diffusion Commit.Jul 26 2018, 7:14 PM

epriestley added a commit: Restricted Diffusion Commit.

epriestley added a revision: Restricted Differential Revision.Jul 30 2019, 6:18 PM

epriestley added a revision: Restricted Differential Revision.Jul 30 2019, 6:20 PM

epriestley added a commit: Restricted Diffusion Commit.Jul 30 2019, 6:21 PM

Herald added a subscriber: amckinley. · View Herald TranscriptJul 30 2019, 6:21 PM

epriestley added a commit: Restricted Diffusion Commit.Jul 30 2019, 6:21 PM

epriestley added a commit: Restricted Diffusion Commit.

epriestley mentioned this in T13656: Automate the Phacility export process, as a support action.Jun 1 2021, 5:40 PM

See T13656 for followup.

	F308644: Screen_Shot_2015-02-17_at_10.45.56_PM.png
	Feb 18 2015, 6:48 AM

Allow users to export their data from PhacilityClosed, ResolvedPublicActions

Description

Revisions and Commits

Related ObjectsSearch...

Event Timeline

Allow users to export their data from Phacility
Closed, ResolvedPublic
Actions

Related Objects
Search...