Page MenuHomePhabricator

Allow users to import data into a new Phacility instance
Open, Needs TriagePublic

Description

This probably has three major components:

  • We need a script that users can run which reasonably produces a single backup archive of an install. This should be smart enough to work across Phabricator versions, have a path forward for cluster installs, etc.
  • We need a reasonable way for users to upload enormous (multi-gigabyte) files.
  • Then we need to import the data. Some of the issues we're likely to encounter:
    • Most login providers aren't usable in the cluster, so we need a way to rebind accounts to Phacility accounts.
    • We have to wipe out some config options which could present security issues before we run the instance.
    • Run-of-the-mill stuff like references to internal repositories breaking.

Revisions and Commits

Event Timeline

epriestley raised the priority of this task from to Needs Triage.
epriestley updated the task description. (Show Details)
epriestley added a project: Phacility.
epriestley moved this task to Backlog on the Phacility board.
epriestley added a subscriber: epriestley.

There's also a half-measure available here, where we at-least-mostly solve (1) and (2) and then do (3) manually. This is probably a good starting point anyway since we won't be able to catch all the issues with the import process in a vacuum.

I've been looking at (2) a bit -- letting instances upload large files. The ideas I came up with are:

ApproachProblems
scpNo way to resume uploads. Hard to secure.
rsyncHard to secure. rsync ships with rrsync which is a sketchy perl script that creeps me out.
Mount NFS DriveSeems hugely really complicated; poor experience with NFS, progress/resume iffy.
sftpHard to secure.
Raw HTTP UploadNot resumable, not sure if we can do progress bars with APCu.
Uh, share a Dropbox folder?Can't automate.
Some Other Third-Party ServiceDon't know of any reasonable services.
Javascript HTTP UploadIs Javascript.

(@btrahan / @chad, not sure if you have better ideas here.)

Using Javascript seems like the least-bad fix here. Roughly, it would go like this:

  • Add a storage engine which splits files into chunks (say, 8MB each).
  • This storage engine uses other storage engines to store the file data.
  • This allows us to stream downloads with a relatively small buffer (16MB-ish) in PHP.
  • Uploads use HTML5 File API to do client-side chunking and upload.
  • This will be some kind of new UI in /files/ for enormous file uploads, I guess.

So the nice properties of this are:

  • We can support arbitrarily large files on non-streamable storage engines.
  • You don't need any special software to upload large files.
  • We can draw progress bars.
  • We can resume uploads.
  • Whole thing uses normal permissions.

Downsides are:

  • Hard to do an upload from a server (although you could use arc upload eventually).
  • You need to leave a browser window open.
  • Lots of JS.
  • We can't easily compute a SHA1 of the file contents (this is not critical).
  • If we run into integrity issues, we need to implement checksumming in JS. ;_;

But that seems like the least-bad of the options. Intended approach:

  • Look at, and possibly fix, T5843.
  • Add the chunked storage engine.
  • Probably make arc upload support it first, since that'll be easier to debug?
  • Once that works, write the JS bit.
  • Add support for HTTP headers to resume downloads.

Sounds good to me. I did a little bit of poking around at some third party services and none seemed to offer a good solution for the browser part, instead offering solutions for the back end part. (e.g. S3 has chunked file upload support)

The new protocol will go roughly like this, either via a new Conduit API method like file.allocate or an extension to file.uploadhash:

Client: I would like to upload a file with data hash H, metadata M, and length L. If possible, I'll resume an existing upload.

Then the server returns one of these responses:

Server: I don't know about data H. L is small enough to upload in one chunk. Go ahead and upload.
Server: I don't know about data H. L is OK, but is too large to upload in one chunk. I have created a new partial file F. Query its chunklist and upload chunks one at a time.
Server: I know some of data H. Resume upload of partial file F by querying chunks and then uploading missing chunks.
Server: I already know about data H. I created a new file F with your metadata.
Server: L is too large, you can not upload the file.
Server: Some other error message (no storage engines, write error, etc).

Chunk querying happens through a new API like file.querychunks, and returns a list like this:

Chunk IDhashstartlengthcompletefilePHID
1abcdef010241PHID-FILE-blah
3null10247890null

This would, e.g., tell the client that it needs to upload the chunk starting at byte 1024 with length 789, because that chunk is allocated but missing.

We also either need a new file.uploadchunk or some more parameters on file.upload -- probably the former.

There's also a half-measure available here, where we at-least-mostly solve (1) and (2) and then do (3) manually. This is probably a good starting point anyway since we won't be able to catch all the issues with the import process in a vacuum.

Rough notes for eventually formalizing this:

  • I created a new, empty local instance.
  • I "renamespaced" the dump into a the instance namespace, used bin/storage destroy to wipe the instance, then imported the dump.
  • I checked DB config for any locked keys (in this case, there were none -- I built this list manually from PhacilitySiteConfig)):
SELECT * FROM config_entry WHERE configKey IN ('asana.project-ids',
'asana.workspace-id', 'celerity.minify', 'celerity.resource-hash',
'metamta.mail-adapter', 'phd.variant-config', 'phd.trace', 'phd.verbose',
'search.engine-selector', 'sms.default-adapter', 'sms.default-sender',
'storage.mysql-engine.max-size', 'syntax-highlighter.engine',
'files.enable-imagemagick', 'log.access.path', 'log.ssh.path',
'metamta.default-address', 'metamta.domain',
'metamta.single-reply-handler-prefix', 'notification.client-uri',
'notification.log', 'notification.pidfile', 'notification.server-uri',
'notification.ssl-cert', 'notification.ssl-key', 'phabricator.production-uri',
'phd.log-directory', 'phd.pid-directory', 'phd.start-taskmasters',
'storage.engine-selector', 'storage.s3.bucket', 'storage.upload-size-limit');
  • I manually destroyed all authentication providers on the instance, and all outstanding temporary tokens.
DELETE FROM auth_providerconfig;
DELETE FROM auth_providerconfigtransaction;
DELETE FROM auth_temporarytoken;
  • I manually destroyed all invites and external account links on the instance:
DELETE FROM user_authinvite;
DELETE FROM user_externalaccount;
  • I manually removed all account password hashes:
UPDATE user SET passwordHash = null, passwordSalt = null;
  • I synchronized the instance from the admin console.
    • This doesn't affect users / external accounts because it identifies accounts already exist on the instance.
    • This does correctly synchronize all the Almanac stuff.
  • I converted the repositories to be serivce-based, then verified they make service calls:
UPDATE repository SET almanacServicePHID = 'PHID-ASRV-ioaqpdwjefhjcywy5eyi';

I deleted all daemon logs and daemon records:

DELETE FROM daemon_log;
DELETE FROM daemon_logevent;

One thing I missed:

  • We need to change local-path in repositories to the correct instance location. This is sort of messy to do manually so I wrote a script for now:
<?php

require_once 'scripts/__init_script__.php';

foreach (new LiskMigrationIterator(new PhabricatorRepository()) as $repo) {
  $details = $repo->getDetails();
  $old_path = $details['local-path'];
  $new_path = '/core/repo/'.PhabricatorEnv::getEnvConfig('cluster.instance').'/'.$repo->getCallsign().'/';

  $details['local-path'] = $new_path;

  queryfx(
    $repo->establishConnection('w'),
    'UPDATE %T SET details = %s WHERE id = %d',
    $repo->getTableName(),
    json_encode($details),
    $repo->getID());

  echo $old_path.' -> '.$new_path;
  echo "\n";
}
epriestley added a commit: Restricted Diffusion Commit.Jul 10 2015, 1:29 PM

I converted all the data destruction steps into a services wipe command. I'm planning to get the repo path stuff formalized too, then I'll update this with a simpler game plan.

The more modern, slightly more concise process is:

  • Suspend and silence the instance.
  • Trigger a manual backup.
  • Stop the daemons.
  • Grab the dump with bin/host download --phid <phid> --save-as whatever.sql.gz.
  • gunzip whatever.sql.gz
  • bin/storage renamespace --from phabricator --to whatever --in whatever.sql > whatever.renamespaced.sql
  • bin/storage destroy
  • mysql -uroot < whatever.renamespaced.sql
  • bin/storage upgrade -f
  • bin/services wipe
  • bin/repository move-paths --from /var/repo --to /core/repo/whatever
  • Sync instance.
  • Fix almanacServicePHID in repositories.
  • Remove all members on admin.
  • Import / unsilence.

This is still a lot of steps but they're relatively less error-prone than they used to be.

Is there an import process available for customers of the Phacility free tier? We have a reasonably complex, self-hosted set up, and we're looking at Phacility but currently have no easy way of migrating the data.

Just need to tarball it up and upload it to us, and we'll do the import when your ready.

I think we're going to end up trialling it with one project for now (just in case there's some prototype stuff that we're using on the self-hosted instance that we run into on Phacility), and then probably manually move other things over if we decide to pull the switch (just because we'll have done work on both at that point).

bin/storage renamespace --from phabricator --to whatever --in whatever.sql > whatever.renamespaced.sql

This is now:

  • bin/storage renamespace --from phabricator --to whatever --input whatever.sql --output whatever.renamespaced.sql

bin/host download --phid <phid> --save-as whatever.sql.gz

This is now:

bin/host download --file <id, phid, or monogram> --save-as whatever.sql.gz

See T13537 for a subtle issue where digestWithNamedKey() keys were cached in APCu on the web tier. Importing instance data may require restarting the web tier until the import process can either dump these caches or version them (versioning may be easier).

The digestWithNamedKey() issue above generally impacts anything using immutable caches, so it can affect CSRF too.