Skip to content

Talk on 09.10.2018

Egor Bogomolov edited this page Oct 9, 2018 · 2 revisions
  • Code2Vec

    • Original implementation by authors is quite hard to understand
    • It uses javaparser
    • src-d implementation isn't complete
    • Attention can be improved
      • Check it during modification stage
  • How to mine file pairs from Idea history

    • Go is cool
      • Works faster
      • There is a babelfish client
    • Python
      • Git for Python
      • Everything else is in Python
  • Prediction

    • Author of methods
  • Parsing

    • LSP (Microsoft)
      • No sdk for python
      • More complex than babelfish
    • TreeSitter
      • Work in progress
      • No Java
    • Javaparser
      • Original implementation of code2vec
      • Downsides are known
      • Hard to extend to other languages
    • Psi
      • You have to run idea to use it
  • Data

    • File before / after
    • Author
      • Clean data to avoid multiple aliases for single author
    • Date of commit
    • Metadata
  • Pipeline

    • Extract pairs of files
    • Build uast
    • Extract all methods
    • Match pairs of methods and detect changed ones
    • Pairs of methods are our dataset
    • Process it to be compatible with code2vec
  • Motivation

    • Code review, knowledge transfer
    • Authorship of small snippets of code
  • Another level of abstraction

    • Batches of methods to get importance of single method
  • Tasks

    • Vladimir: mining in python
    • Egor: code2vec, review of implementations
Clone this wiki locally