Move object destruction to taskmaster daemons
Closed, WontfixPublic
Actions

Assigned To

Authored By

	joshuaspence
	Jun 8 2015, 2:45 AM

Description

As per T8433, object destruction with ./bin/remove destroy can be slow in some cases. Possibly this should be moved to the taskmaster daemons.

Related Objects

Mentioned Here: T8830: `bin/remove destroy` is dangerous and leaves dangling edges
T8433: Destroying a Diviner book is very slow

Event Timeline

joshuaspence created this task.Jun 8 2015, 2:45 AM

joshuaspence raised the priority of this task from to Needs Triage.

joshuaspence updated the task description. (Show Details)

joshuaspence added projects: Daemons, Infrastructure.

joshuaspence added a subscriber: joshuaspence.

Why do you need to destroy enough objects for this to matter?

I don't, but it maybe stops bad things from happening in cases where destruction takes a bad time? I was destroying a Diviner book and it took a long time so I ^Ced it and ran it again with --trace. I wasn't too worried in this case because Diviner books don't utilize a lot of infrastructure. But for other objects, this could be problematic maybe?

Alternatively, we could show a progress bar and that would probably be sufficient.

I think it's good that destruction happens in the foreground, since I think this should be a very rare operation and errors should always be reported to the user. Users who want to background long destruction operations for some reason can use tools like screen.

We can't print a progress bar because we don't know how many things need to be deleted. In the case of bin/remove destroy <huge book>, we're only deleting one "thing", but deleting that requires deleting very many other things. We could dynamically update the things-to-be-deleted count but the bar would be pretty meaningless and basically just be the same as printing "still working!" over and over again. I don't think this is very useful, since it's not any different from not printing "still working!" and continuing to work.

If you're concerned about hangs, you can generally use SIGHUP to dump a stack, strace -p <pid>, or ps auxwww to look for subprocesses, or mysql + show full processlist to make sure that an in-flight process is really making progress and not hung, instead of ^C'ing it and re-running with --trace.

I think we can't really prevent ^C from leaving things partially deleted since we can't transactionally delete all data in all cases (e.g., data in different storage engines, like the fulltext index or file storage engine).

We might be able to restructure engines so they generally attempt to clean up child/leaf data before parent/root data, which is maybe fine, but I worry slightly that it maybe worse to leave a root with missing leaves than leaves with a missing root: it's more likely that a root with missing leaves is still a "live object" which may have unexpected behavior without its leaves.

For example, imagine destroying an Almanac service and having it abort halfway through. If we delete root-first, we end up with some disconnected properties floating around in the database. Not ideal, but basically fine. If we delete leaves-first, we end up with a live node that is missing some properties it previously had! This could give the node a different meaning if those properties affected access control, service limits, etc. -- it's hypothetically bad if we delete a doNotUseThisService = true property and leave the service active.

Overall, no approach I can come up with here seems like it provides enough value to justify the additional complexity.

At the extreme, we could split destruction into two phases: in the first phase, we count the total number of operations we will need to perform (and perhaps estimate their cost). We then could optionally prompt the user ("this will destroy more than 29,320 objects and probably take about an hour, are you sure you want to continue?"), and be able to draw an accurate progress bar.

However, this would require making every DestructionExtension about twice as complex (to have an "estimate" phase and a "destroy" phase), and the estimation phase would also be very slow in some cases. We could add an estimate-phase-estimation-phase but this is silly and I think it all reduces to meaninglessness anyway.

See also T8830 for this command not being bin/time rewind.

Move object destruction to taskmaster daemonsClosed, WontfixPublicActions

Description

Related Objects

Event Timeline

Move object destruction to taskmaster daemons
Closed, WontfixPublic
Actions