Allow users to import data into a new Phacility instance
Open, Needs TriagePublic
Actions

Assigned To

None

Authored By

	epriestley
	Feb 4 2015, 2:15 PM

Description

This probably has three major components:

We need a script that users can run which reasonably produces a single backup archive of an install. This should be smart enough to work across Phabricator versions, have a path forward for cluster installs, etc.
We need a reasonable way for users to upload enormous (multi-gigabyte) files.
Then we need to import the data. Some of the issues we're likely to encounter:
- Most login providers aren't usable in the cluster, so we need a way to rebind accounts to Phacility accounts.
- We have to wipe out some config options which could present security issues before we run the instance.
- Run-of-the-mill stuff like references to internal repositories breaking.

Revisions and Commits

			Restricted Diffusion Commit
rPHU libphutil
		D12081	rPHUce3959b4046f Allow callers to explicitly request a progress bar redraw
rARC Arcanist
		D12082	rARCb961869edac9 Explicitly draw progress bar when resuming file uploads in `arc upload`
		D12077	rARC856cbed527ca Remove "force chunking" code from Arcanist
		D12061	rARCa01d3c3b1a5a Make "arc upload" chunk-aware
rP Phabricator
	Abandoned		D12065 Don't implement sha1 hashing in Javascript
		D13621	rPfc72b000f0af Add `repository list-paths` and `repository move-paths`
		D12105	rP924b135d310b Add a `storage renamespace` for mangling SQL dumpfiles into a new namespace
		D12093	rP7a93b443c3d3 Make file upload policies more consistent
		D12083	rP7482d260b08d Rewrite file documentation to be chunk-aware
	Audited	D12080	rP21aa086b6925 Improve translation of some file strings
		D12079	rP1773af6ada08 Enable the chunk storage engine
		D12076	rP6b69bc3fbbb1 Delete all "force chunking" file upload code
		D12073	rPc19bb57730a2 Stream chunks when sending chunked files
		D12072	rP81d88985a027 Prepare file responses for streaming chunks
		D12071	rP2aefb438439d Support a file data iteration interface for large files
		D12070	rP32d8d675357c Support resuming JS uploads of chunked files
		D12066	rP135280be9e1f Support HTML5 / Javascript chunked file uploads
		D12063	rPaa4adf3ab870 Add support for partially uploaded files
		D12062	rP6c3552f93963 Add `bin/files cat` to print a file to stdout
		D12060	rP4aed453b06e4 Add a chunking storage engine for files
		D12051	rP973079a7da3d Modularize application configuration panels

Related Objects
Search...

Status	Assigned	Task
Resolved	epriestley	T7522 Add tooling to backpopulate instance accounts into the cluster
Open	None	T7149 Allow users to import data into a new Phacility instance
Resolved	thoughtpolice	T5843 Make storage backends extensible

Event Timeline

epriestley created this task.Feb 4 2015, 2:15 PM

epriestley raised the priority of this task from to Needs Triage.

epriestley updated the task description. (Show Details)

epriestley added a project: Phacility.

epriestley moved this task to Backlog on the Phacility board.

epriestley added a subscriber: epriestley.

There's also a half-measure available here, where we at-least-mostly solve (1) and (2) and then do (3) manually. This is probably a good starting point anyway since we won't be able to catch all the issues with the import process in a vacuum.

epriestley moved this task from Backlog to Do After Launch on the Phacility board.Feb 4 2015, 8:03 PM

epriestley mentioned this in T1315: Phacility Launch Status.Feb 5 2015, 3:39 PM

joshuaspence added a subscriber: joshuaspence.Feb 14 2015, 12:46 PM

I've been looking at (2) a bit -- letting instances upload large files. The ideas I came up with are:

Approach	Problems
`scp`	No way to resume uploads. Hard to secure.
`rsync`	Hard to secure. `rsync` ships with `rrsync` which is a sketchy perl script that creeps me out.
Mount NFS Drive	Seems hugely really complicated; poor experience with NFS, progress/resume iffy.
`sftp`	Hard to secure.
Raw HTTP Upload	Not resumable, not sure if we can do progress bars with APCu.
Uh, share a Dropbox folder?	Can't automate.
Some Other Third-Party Service	Don't know of any reasonable services.
Javascript HTTP Upload	Is Javascript.

(@btrahan / @chad, not sure if you have better ideas here.)

Using Javascript seems like the least-bad fix here. Roughly, it would go like this:

Add a storage engine which splits files into chunks (say, 8MB each).
This storage engine uses other storage engines to store the file data.
This allows us to stream downloads with a relatively small buffer (16MB-ish) in PHP.
Uploads use HTML5 File API to do client-side chunking and upload.
This will be some kind of new UI in /files/ for enormous file uploads, I guess.

So the nice properties of this are:

We can support arbitrarily large files on non-streamable storage engines.
You don't need any special software to upload large files.
We can draw progress bars.
We can resume uploads.
Whole thing uses normal permissions.

Downsides are:

Hard to do an upload from a server (although you could use arc upload eventually).
You need to leave a browser window open.
Lots of JS.
We can't easily compute a SHA1 of the file contents (this is not critical).
If we run into integrity issues, we need to implement checksumming in JS. ;_;
- Maybe this isn't that bad, e.g. see http://crypto-js.googlecode.com/svn/tags/3.1.2/build/rollups/sha1.js

But that seems like the least-bad of the options. Intended approach:

Look at, and possibly fix, T5843.
Add the chunked storage engine.
Probably make arc upload support it first, since that'll be easier to debug?
Once that works, write the JS bit.
Add support for HTTP headers to resume downloads.

Sounds good to me. I did a little bit of poking around at some third party services and none seemed to offer a good solution for the browser part, instead offering solutions for the back end part. (e.g. S3 has chunked file upload support)

epriestley added a parent task: T7522: Add tooling to backpopulate instance accounts into the cluster.Mar 11 2015, 6:42 PM

epriestley mentioned this in D11224: Allow custom storage engines to be identified automatically.Mar 12 2015, 12:16 AM

epriestley added a revision: D12051: Modularize application configuration panels.Mar 12 2015, 7:09 PM

epriestley added a subtask: T5843: Make storage backends extensible.Mar 12 2015, 7:29 PM

The new protocol will go roughly like this, either via a new Conduit API method like file.allocate or an extension to file.uploadhash:

Client: I would like to upload a file with data hash H, metadata M, and length L. If possible, I'll resume an existing upload.

Then the server returns one of these responses:

Server: I don't know about data H. L is small enough to upload in one chunk. Go ahead and upload.
Server: I don't know about data H. L is OK, but is too large to upload in one chunk. I have created a new partial file F. Query its chunklist and upload chunks one at a time.
Server: I know some of data H. Resume upload of partial file F by querying chunks and then uploading missing chunks.
Server: I already know about data H. I created a new file F with your metadata.
Server: L is too large, you can not upload the file.
Server: Some other error message (no storage engines, write error, etc).

Chunk querying happens through a new API like file.querychunks, and returns a list like this:

Chunk ID	hash	start	length	complete	filePHID
1	abcdef	0	1024	1	PHID-FILE-blah
3	null	1024	789	0	null

This would, e.g., tell the client that it needs to upload the chunk starting at byte 1024 with length 789, because that chunk is allocated but missing.

We also either need a new file.uploadchunk or some more parameters on file.upload -- probably the former.

epriestley added a commit: rP973079a7da3d: Modularize application configuration panels.Mar 12 2015, 8:28 PM

epriestley closed subtask T5843: Make storage backends extensible as Resolved.

epriestley added a revision: D12060: Add a chunking storage engine for files.Mar 13 2015, 12:10 AM

epriestley added a revision: D12061: Make "arc upload" chunk-aware.Mar 13 2015, 12:13 AM

epriestley mentioned this in T7411: Allow uploading files from the command line.Mar 13 2015, 1:20 AM

epriestley added a revision: D12062: Add `bin/files cat` to print a file to stdout.Mar 13 2015, 1:23 AM

epriestley added a revision: D12063: Add support for partially uploaded files.Mar 13 2015, 2:09 AM

epriestley added a revision: D12065: Don't implement sha1 hashing in Javascript.Mar 13 2015, 11:03 AM

epriestley mentioned this in T7522: Add tooling to backpopulate instance accounts into the cluster.Mar 13 2015, 12:50 PM

epriestley added a revision: D12066: Support HTML5 / Javascript chunked file uploads.Mar 13 2015, 12:54 PM

epriestley added a commit: rP4aed453b06e4: Add a chunking storage engine for files.Mar 13 2015, 6:30 PM

epriestley added a commit: rP6c3552f93963: Add `bin/files cat` to print a file to stdout.

epriestley added a commit: rPaa4adf3ab870: Add support for partially uploaded files.

epriestley added a commit: rP135280be9e1f: Support HTML5 / Javascript chunked file uploads.

epriestley added a commit: rARCa01d3c3b1a5a: Make "arc upload" chunk-aware.

epriestley added a revision: D12070: Support resuming JS uploads of chunked files.Mar 13 2015, 7:28 PM

epriestley added a revision: D12071: Support a file data iteration interface for large files.Mar 13 2015, 7:39 PM

epriestley added a revision: D12072: Prepare file responses for streaming chunks.Mar 13 2015, 8:57 PM

epriestley added a revision: D12073: Stream chunks when sending chunked files.Mar 13 2015, 9:37 PM

epriestley added a commit: rP32d8d675357c: Support resuming JS uploads of chunked files.Mar 14 2015, 3:28 PM

epriestley added a commit: rP2aefb438439d: Support a file data iteration interface for large files.

epriestley added a commit: rP81d88985a027: Prepare file responses for streaming chunks.

epriestley added a commit: rPc19bb57730a2: Stream chunks when sending chunked files.

epriestley added a revision: D12076: Delete all "force chunking" file upload code.Mar 14 2015, 3:40 PM

epriestley added a revision: D12077: Remove "force chunking" code from Arcanist.Mar 14 2015, 3:43 PM

epriestley added a revision: D12079: Enable the chunk storage engine.Mar 14 2015, 4:23 PM

epriestley added a revision: D12080: Improve translation of some file strings.Mar 14 2015, 4:28 PM

epriestley added a revision: D12081: Allow callers to explicitly request a progress bar redraw.Mar 14 2015, 4:34 PM

epriestley added a revision: D12082: Explicitly draw progress bar when resuming file uploads in `arc upload`.

epriestley added a revision: D12083: Rewrite file documentation to be chunk-aware.Mar 14 2015, 5:49 PM

epriestley added a commit: rPHUce3959b4046f: Allow callers to explicitly request a progress bar redraw.Mar 15 2015, 6:31 PM

epriestley added a commit: rARC856cbed527ca: Remove "force chunking" code from Arcanist.

epriestley added a commit: rARCb961869edac9: Explicitly draw progress bar when resuming file uploads in `arc upload`.

epriestley added a commit: rP6b69bc3fbbb1: Delete all "force chunking" file upload code.

epriestley added a commit: rP1773af6ada08: Enable the chunk storage engine.Mar 15 2015, 6:37 PM

epriestley added a commit: rP21aa086b6925: Improve translation of some file strings.

epriestley added a commit: rP7482d260b08d: Rewrite file documentation to be chunk-aware.

epriestley mentioned this in T5187: Make "Upload File" dialog have a vanilla file upload control.Mar 15 2015, 7:28 PM

epriestley mentioned this in T5155: Evaluate support for AWS IAM Roles in S3 Client.Mar 15 2015, 7:31 PM

In T7149#94564, @epriestley wrote:

There's also a half-measure available here, where we at-least-mostly solve (1) and (2) and then do (3) manually. This is probably a good starting point anyway since we won't be able to catch all the issues with the import process in a vacuum.

epriestley added a revision: D12093: Make file upload policies more consistent.Mar 17 2015, 12:20 AM

epriestley added a commit: rP7a93b443c3d3: Make file upload policies more consistent.Mar 17 2015, 1:33 PM

epriestley added a revision: D12105: Add a `storage renamespace` for mangling SQL dumpfiles into a new namespace.Mar 17 2015, 11:46 PM

Rough notes for eventually formalizing this:

I created a new, empty local instance.
I "renamespaced" the dump into a the instance namespace, used bin/storage destroy to wipe the instance, then imported the dump.
I checked DB config for any locked keys (in this case, there were none -- I built this list manually from PhacilitySiteConfig)):

SELECT * FROM config_entry WHERE configKey IN ('asana.project-ids',
'asana.workspace-id', 'celerity.minify', 'celerity.resource-hash',
'metamta.mail-adapter', 'phd.variant-config', 'phd.trace', 'phd.verbose',
'search.engine-selector', 'sms.default-adapter', 'sms.default-sender',
'storage.mysql-engine.max-size', 'syntax-highlighter.engine',
'files.enable-imagemagick', 'log.access.path', 'log.ssh.path',
'metamta.default-address', 'metamta.domain',
'metamta.single-reply-handler-prefix', 'notification.client-uri',
'notification.log', 'notification.pidfile', 'notification.server-uri',
'notification.ssl-cert', 'notification.ssl-key', 'phabricator.production-uri',
'phd.log-directory', 'phd.pid-directory', 'phd.start-taskmasters',
'storage.engine-selector', 'storage.s3.bucket', 'storage.upload-size-limit');

I manually destroyed all authentication providers on the instance, and all outstanding temporary tokens.

DELETE FROM auth_providerconfig;
DELETE FROM auth_providerconfigtransaction;
DELETE FROM auth_temporarytoken;

I manually destroyed all invites and external account links on the instance:

DELETE FROM user_authinvite;
DELETE FROM user_externalaccount;

I manually removed all account password hashes:

UPDATE user SET passwordHash = null, passwordSalt = null;

I synchronized the instance from the admin console.
- This doesn't affect users / external accounts because it identifies accounts already exist on the instance.
- This does correctly synchronize all the Almanac stuff.
I converted the repositories to be serivce-based, then verified they make service calls:

UPDATE repository SET almanacServicePHID = 'PHID-ASRV-ioaqpdwjefhjcywy5eyi';

I deleted all daemon logs and daemon records:

DELETE FROM daemon_log;
DELETE FROM daemon_logevent;

epriestley added a commit: rP924b135d310b: Add a `storage renamespace` for mangling SQL dumpfiles into a new namespace.Mar 18 2015, 1:29 AM

One thing I missed:

We need to change local-path in repositories to the correct instance location. This is sort of messy to do manually so I wrote a script for now:

<?php

require_once 'scripts/__init_script__.php';

foreach (new LiskMigrationIterator(new PhabricatorRepository()) as $repo) {
  $details = $repo->getDetails();
  $old_path = $details['local-path'];
  $new_path = '/core/repo/'.PhabricatorEnv::getEnvConfig('cluster.instance').'/'.$repo->getCallsign().'/';

  $details['local-path'] = $new_path;

  queryfx(
    $repo->establishConnection('w'),
    'UPDATE %T SET details = %s WHERE id = %d',
    $repo->getTableName(),
    json_encode($details),
    $repo->getID());

  echo $old_path.' -> '.$new_path;
  echo "\n";
}

epriestley added a commit: Restricted Diffusion Commit.Jul 10 2015, 1:29 PM

I converted all the data destruction steps into a services wipe command. I'm planning to get the repo path stuff formalized too, then I'll update this with a simpler game plan.

epriestley added a revision: D13621: Add `repository list-paths` and `repository move-paths`.Jul 10 2015, 2:47 PM

epriestley added a commit: rPfc72b000f0af: Add `repository list-paths` and `repository move-paths`.Jul 16 2015, 9:11 PM

epriestley mentioned this in T9303: Improve Phacility Onboarding/NUX.Sep 2 2015, 7:35 PM

The more modern, slightly more concise process is:

Suspend and silence the instance.
Trigger a manual backup.
Stop the daemons.
Grab the dump with bin/host download --phid <phid> --save-as whatever.sql.gz.
gunzip whatever.sql.gz
bin/storage renamespace --from phabricator --to whatever --in whatever.sql > whatever.renamespaced.sql
bin/storage destroy
mysql -uroot < whatever.renamespaced.sql
bin/storage upgrade -f
bin/services wipe
bin/repository move-paths --from /var/repo --to /core/repo/whatever
Sync instance.
Fix almanacServicePHID in repositories.
Remove all members on admin.
Import / unsilence.

This is still a lot of steps but they're relatively less error-prone than they used to be.

allixsenos awarded a token.Apr 5 2016, 8:31 PM

Is there an import process available for customers of the Phacility free tier? We have a reasonably complex, self-hosted set up, and we're looking at Phacility but currently have no easy way of migrating the data.

Herald added a subscriber: eadler. · View Herald TranscriptFeb 15 2017, 2:58 AM

Just need to tarball it up and upload it to us, and we'll do the import when your ready.

I think we're going to end up trialling it with one project for now (just in case there's some prototype stuff that we're using on the self-hosted instance that we run into on Phacility), and then probably manually move other things over if we decide to pull the switch (just because we'll have done work on both at that point).

bin/storage renamespace --from phabricator --to whatever --in whatever.sql > whatever.renamespaced.sql

This is now:

bin/storage renamespace --from phabricator --to whatever --input whatever.sql --output whatever.renamespaced.sql

bin/host download --phid <phid> --save-as whatever.sql.gz

This is now:

bin/host download --file <id, phid, or monogram> --save-as whatever.sql.gz

See T13537 for a subtle issue where digestWithNamedKey() keys were cached in APCu on the web tier. Importing instance data may require restarting the web tier until the import process can either dump these caches or version them (versioning may be easier).

Herald added a subscriber: amckinley. · View Herald TranscriptMay 18 2020, 5:37 PM

The digestWithNamedKey() issue above generally impacts anything using immutable caches, so it can affect CSRF too.

Allow users to import data into a new Phacility instanceOpen, Needs TriagePublicActions

Description

Revisions and Commits

Related ObjectsSearch...

Event Timeline

Allow users to import data into a new Phacility instance
Open, Needs TriagePublic
Actions

Related Objects
Search...