-
Notifications
You must be signed in to change notification settings - Fork 274
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
implements the superset disassembler #944
base: master
Are you sure you want to change the base?
implements the superset disassembler #944
Conversation
Cool, glad to see this PR. Great work! But we have lots of things to do. First of all, let's set up our end goals, the final requirements, so that we can always trace our decisions back to them. UPDATE: moved the specification to the top post. So, godspeed! And don't hesitate to ask questions and further directions, we will happy to help and assist you. Now, when we have this PR as good vantage point it should be easy. |
Thanks for your review. Yes I absolutely agree with those requirements and the code quality rules. Ok, if you plan to move all the disassemblers out, then perhaps it mostly belongs under plugins? I can make a library for it. I can certainly do the command line updates to Bap, and the documentation requests. However, the API changes require much more in depth thought, so I want to ask about those. There are certainly many improvements that need to be made. In so much as the API goes, I would like to ask if the internals of the graph implementation should be kept, because I had to have the ability to make the graph an reverse instruction sequence graph, where per-instruction flows could be recorded. When I started it, I don't think that graph interface that Bap now provides was available, and it was hard to refactor then. In addition, I think that the features implemented should be provided as a kind of plugin or registration capability so that others see that they can provide new heuristics to the fixpoint convergence. For the most part, there isn't so much usage of design where current features are kind of implemented directly on the naked representation, so I think this represents the first refactoring that needs to be done. Thoughts? In so much as the library, plugin, and tests go, I can make the interface look ok pretty fast, and I have a binary with a cmdliner that provides the ability to call it up. I do already have some tests, but they need to be refactored after the design and interface are done and they need to provide better coverage. But for the plugin capability of superset_disasm, without additional ability to intervene in the construction process of bap while it's lifting a binary, I don't see how I can provide a plugin to meaningful plugin because there is not a way yet to provide a pass that runs at disassembly time. I asked about that once on the channel, and you advised that I should provide my own instance of the bap process chain, but I do know that that will get refactored and provided to users, so that if I implement it, I will likely need to later remove such a chain. Among further or additional work that may be done to provide substantial improvements, I have a number of suggestions regarding the workings of the disassembler that represent different improvements to various things, including at least performance and code quality. Right now, the capability as implemented functions at a sufficient level when used, but there is room for it to become much sharper. Should I put those forward for review? |
Yep. Can't really predict how much code will end up in the plugins and how much will end up in the lib folder, but definitely lots of code doesn't belong to library here.
You can always expose some of the interface as experimental and some as stable. You can also play with library vs plugin tradeoff, where libraries stored in the plugin are private, vs libraries in the Having it extensible is really nice. So yes, we shall add some injection points, where users can inject their functions.
No problem, we will add an injection point in BAP, so that a user can specify its own disassembling algorithm. We didn't have one, just because we had only one option :) Now we have the superset disassembler, VSA, and other options, so it is time to create such point.
I think we shall focus on achieving our requirements and keep the room for further improvements. Ideal course of actions, is to make it work with minimal API exposed, then gradually improve and expose more and more functionality via the library interface. |
besides, do not hesitate to edit the top post and update it with new information. Treat it as a project wiki. |
Ok, all of that sounds good. "top pos"? |
the top post |
Should it be updated on top of BAP 2.0 PR then? #960 |
I will update it to use BAP 2.0 |
OK, now when BAP 2.0 has became a reality, we should rethink this PR from scratch. It basically provides some facilities that we no longer need (we now have a speculative disassembler by default) and some facilities that we need, and which are much easier to integrate now (the probabilistic part of the disassembler). In BAP 2.0 we have a knowledge base, which is basically a global storage of information which could be used by different components for information exchange. The new disassembler that drives program reconstruction in modern BAP is by default speculative. It will consult with the knowledge base, when it will decide whether it needs to speculate or not. In particular, it will consult with the So the only thing that we need from the probabilistic disassembler is to mark what it thinks to be code with Also, the new disassembler is backtracking and it will automatically classify as data all instruction sequences which end up in data (i.e., if an execution from a certain point is guaranteed to hit an invalid instruction or otherwise non-executable code, then all instructions that constitute this chain will be marked as data). I hope this makes sense, for starters, so if you're interested (or anyone else is interested) in updating this PR and bringing the probabilistic disassembler to BAP, then feel free to ask questions and requests for explanations :) |
I'm still very interested in finishing this out. I've been waiting until bap 2.0 lands to continue the plugin development, because I had started on it. I'm also super busy with other tasks, so it may be a while. On another note, I do have some considerations that I'll have to carefully examine. Having the speculative disassembler that BAP now provides be backtracking in particular kind of makes me wonder exactly what your specification of what backtracking is. You wrote "backtracking and it will automatically classify as data all instruction sequences which end up in data" - are those two different features or part of the same feature? If different, and if by eliminating sequences of instructions that fall through to a well-formed assembler violation by traveling from each instruction to its ancestor set, then that was the mechanism which my disassembler had implemented itself. If you mean backtracking in another sense, then I'll need to be able to control it with some kind of lambda function or something. I haven't seen the new API, so I'm not certain yet. Also, on the speculation part, I would like to know if I first should do all the processing on my side before providing the set of labels through the knowledge api to the disassembler, or if my api is something that will get resolved by the BAP mechanism, and your is-valid can be run iteratively. Because, I have found that having a fixpoint for disassembly confidence convergence is one mechanism, but there are also some good results for iterating on the disassembly itself, because you can obtain increasingly better results. |
It may take me a long time, but I won't give up on merging this. |
I guess the milestone should be moved then? |
Yes, you could update that. Sorry, I've been super busy working pursuing a degree and research at the same time. |
Ok, upon rethinking this, one of the first things I've encountered is that because I cannot retrieve the original image from the project that I have to load the file by retrieving the name from somewhere in the pass. Which will get garbage collected away, so not that big of a deal, and I will then write the results bap knowledge as a retrieval from the superset. |
If you mean code section, you can get it, use
That is not a good idea. First of all, it is not guaranteed that there will be a file at all. Second, it will not work with the existing file loaders. Third, it is just suboptimal. |
This PR implements the superset disassembler as described in the following paper. The top post acts as a working documentation. We will update it as the work and discussions proceed.
Requirements Specification
Functional Requirements
The first requirement will enable seamless integration in the platform, I would like to be able to do
bap /bin/ls --disassembler=superset
. It shall be packed as a separate configuration feature, so that we can do./configure --enable-superset-disassembler
. The second requirement will allow me to choose my preferences, e.g.,bap /bin/ls --disassembler=superset --superset-disassembler-features=loops-with-breaks
. Finally, the disassembler should expose a stable API which could be used to build ad-hoc and fine grained solutions.So, if you agree with those goals, let's always keep them in mind. Right now, as far as I understand the code, only option 3 is partially fulfilled.
Administrative Requirements
Next, are non-functional requirements, so to say administrative issues. As an artifact, this code doesn't belong to bap_disasm. Neither it should be internal to
bap.std
library nor it shall be a part ofBap.Std
interface. (The same is true for recursive descent disassembler, and we will remove it in BAP 2.0 in a separate library). Therefore it should be packed into two (optionally three) components. The reusable library, which exposes the programmatic interface to the disassembler. The library shall depend on theBap.Std
interface and, if necessary, others. A plugin, which exposes some of the library interface to the command line interface, making sane defaults. And a frontend, which will provide utility functions (we can pack them into the plugin, we will see later, whether we need it or not, the main concern would be dependencies). Therefore, we have the following tasks:lib/bap_disasm
.Concerning the requirement 2.1, it is not really necessary to keep it in the bap repository, if you want you can keep it in your own repository, move your repository to BinaryAnalysisPlatform organization, or pick a place in the bap repository, e.g.,
lib/bap_superset_disassembler
.And no matter what choice you will made, you have to give proper names to all your compilation units, aka files. OCaml has a flat namespace for compilation units, so if you have a file named
features.ml
you will not be able to link any other plugin or library that has the same file in its code base. Therefore, you need to prefix all your files, e.g., start all library files withbap_superset_disassembler_
and all your plugin files withsuper_disassembler_
prefixes.Coding Standards Requirements
Those are self-explanatory
todo
s in the released codeQuality Requirements
The number of the tests is to be decided. Though I would like to have close to 100% coverage of the core components.
Documentation Requirements
The overall description should include a brief overview of the algorithm, purposes and tradeoffs. It shall reference the paper. If there are any differences between the paper and the implementation they should be highlighted. This documentation will end up in the plugin man page. So a user shall be able to understand without further ado, why does he need this plugin, how to enable it, and how to configure.
The detailed documentation is needed for us to support and bug fix it. It could be spread around the github discussions, comments in the internal mli files, and ml files. It shall document the purposes and invariants (if any) of all modules, and some crucial functions.
Finally, all public (accessible via the public mli file) functions shall be thoroughly documented, so that a user can apply them without having to refer to the implementation.