-
Notifications
You must be signed in to change notification settings - Fork 1
Talk on 09.10.2018
Egor Bogomolov edited this page Oct 9, 2018
·
2 revisions
-
Code2Vec
- Original implementation by authors is quite hard to understand
- It uses javaparser
- src-d implementation isn't complete
- Attention can be improved
- Check it during modification stage
-
How to mine file pairs from Idea history
- Go is cool
- Works faster
- There is a babelfish client
- Python
- Git for Python
- Everything else is in Python
- Go is cool
-
Prediction
- Author of methods
-
Parsing
- LSP (Microsoft)
- No sdk for python
- More complex than babelfish
- TreeSitter
- Work in progress
- No Java
- Javaparser
- Original implementation of code2vec
- Downsides are known
- Hard to extend to other languages
- Psi
- You have to run idea to use it
- LSP (Microsoft)
-
Data
- File before / after
- Author
- Clean data to avoid multiple aliases for single author
- Date of commit
- Metadata
-
Pipeline
- Extract pairs of files
- Build uast
- Extract all methods
- Match pairs of methods and detect changed ones
- Pairs of methods are our dataset
- Process it to be compatible with code2vec
-
Motivation
- Code review, knowledge transfer
- Authorship of small snippets of code
-
Another level of abstraction
- Batches of methods to get importance of single method
-
Tasks
- Vladimir: mining in python
- Egor: code2vec, review of implementations