Dates
- Assignment posted on 19 May 2015.
- Scrum 2 June 2015.
- Presentation 9 June 2015.
Rationale
Students exercise the MSR phase of data extraction while combining practical and research attitude. Data extraction precedes any sort of data synthesis (e.g., metrics) and data analysis (e.g., distribution). Data extraction is the most software engineering-oriented part of MSR. Data synthesis and analysis are much closer to the general discipline of information retrieval. Ideally, students will continue on their projects during subsequent synthesis/analysis-related assignments.
Context
To provide some common context for the students, let's focus on "developer profiling", which is defined here to mean that we aim to extract information from software repositories that allows us to compare or rank developers on the grounds of metrics or qualities or topics or alike.
Teams
Students can work alone or couple in teams of size 2. If you work in a team, you must make an extra effort to convince everyone that all team members have made similar contributions to the assignment.
Assignment
- Formulate a preliminary research question related to developer profiling.
- We will later look into data synthesis and analysis. The question may be revised then.
- Identify a concrete data source to which you have access.
- Please note that the data source needs to include "traces" of multiple developers.
- Do not select a very general source such as "(all of) GitHub".
- Use existing publications for inspiration; see below.
- In the interest of limiting your effort, pick a relatively "small" data source or filter.
- Identify access data access technologies; e.g.:
- If you access GitHub source code, familiarize yourself with the relevant API.
- Implement raw data extraction.
- For instance, you may dump your raw into XML, JSON, RDF, or a SQL/noSQL database.
- Implement extra data extraction activities, where necessary.
- This could be filtering, transformation, abstraction, e.g.:
- If you plan to process text, familiarize yourself with stemming, e.g., on the grounds of NLTK.
- If you plan to analyze program identifiers, familiarize yourself with identifier splitting.
- Use existing publications for inspiration; see below.
- It's enough to dump data past such extra activities (and not to dump raw data).
- This could be filtering, transformation, abstraction, e.g.:
- Report on related data show cases at the MSR conference or elsewhere.
- What was the data source?
- What technologies were used?
- …
- Submit all source code and slides of your presentation to SVN.
- Submit your data dumps, if feasible (< 1MB), to SVN.
- If you want to use public or unikold Git, please submit the repo URL to SVN.
Sample
- Short paper on API-based developer profiling: http://softlang.uni-koblenz.de/apidevprof/
- MSc thesis on API-based developer profiling: http://softlang.wikidot.com/event:150413-msc1
Contact
If there are any questions, please contact Ralf Lämmel <ed.znelbok-inu|lemmeal#ed.znelbok-inu|lemmeal>.
If you want to get your plan approved, please also contact Ralf Lämmel.
Scrum
Be prepared (no slides!) to briefly summarize your choices regarding the assignment parameters. Also, identify open problems you might have so that teaching staff or fellow students can help. Please commit a short README (as a summary of standup comedy) to the SVN.
Presentation
- Prepare a 10-13min talk with 15 slides or less.
- Address the parameters from the assignment explicitly.
- Demo your data extractor.
- Try to give a good talk, as you were advised before.
Data sources
Also have a look at "Data show cases" http://dblp.uni-trier.de/db/conf/msr/msr2014.html.
- SCS (source control system)
- Source code
- [14-1-1] http://dl.acm.org/citation.cfm?doid=2597073.2597085
- [14-2-2] http://dl.acm.org/citation.cfm?doid=2597073.2597111
- [14-2-3] http://dl.acm.org/citation.cfm?doid=2597073.2597094
- [14-4-2] http://dl.acm.org/citation.cfm?doid=2597073.2597077
- [14-5-4] http://dl.acm.org/citation.cfm?doid=2597073.2597096
- [14-7-2] http://dl.acm.org/citation.cfm?doid=2597073.2597082
- [14-7-3] http://dl.acm.org/citation.cfm?doid=2597073.2597087
- [14-8-3] http://dl.acm.org/citation.cfm?doid=2597073.2597109
- [14-9-1] http://dl.acm.org/citation.cfm?doid=2597073.2597102
- Revisions
- Change logs/commits
- [14-2-2] http://dl.acm.org/citation.cfm?doid=2597073.2597111
- [14-3-3] http://dl.acm.org/citation.cfm?doid=2597073.2597108
- [14-4-1] http://dl.acm.org/citation.cfm?doid=2597073.2597074
- [14-4-4] http://dl.acm.org/citation.cfm?doid=2597073.2597081
- [14-5-4] http://dl.acm.org/citation.cfm?doid=2597073.2597096
- [14-6-1] http://dl.acm.org/citation.cfm?doid=2597073.2597075
- [14-6-2] http://dl.acm.org/citation.cfm?doid=2597073.2597078
- [14-7-1] http://dl.acm.org/citation.cfm?doid=2597073.2597076
- [14-8-1] http://dl.acm.org/citation.cfm?doid=2597073.2597107
- [14-10-4] http://dl.acm.org/citation.cfm?doid=2597073.2597093
- [14-10-8] http://dl.acm.org/citation.cfm?doid=2597073.2597095
- Pull requests
- Source code
- Execution data
- Test results
- Execution traces
- Monitoring
- [14-1-1] http://dl.acm.org/citation.cfm?doid=2597073.2597085
- [14-1-2] http://dl.acm.org/citation.cfm?doid=2597073.2597097
- [14-5-4] http://dl.acm.org/citation.cfm?doid=2597073.2597096
- [14-8-2] http://dl.acm.org/citation.cfm?doid=2597073.2597092
- [14-9-4] http://dl.acm.org/citation.cfm?doid=2597073.2597106
- [14-10-1] http://dl.acm.org/citation.cfm?doid=2597073.2597103
- Communication
- Mailing lists
- News groups
- Social media
- Comments
- Forum threads
- Peer code reviews
- DTS (defect tracking system)
- Bug reports
- [14-2-3] http://dl.acm.org/citation.cfm?doid=2597073.2597094
- [14-3-1] http://dl.acm.org/citation.cfm?doid=2597073.2597098
- [14-3-2] http://dl.acm.org/citation.cfm?doid=2597073.2597099
- [14-3-3] http://dl.acm.org/citation.cfm?doid=2597073.2597108
- [14-4-4] http://dl.acm.org/citation.cfm?doid=2597073.2597081
- [14-5-4] http://dl.acm.org/citation.cfm?doid=2597073.2597096
- [14-6-2] http://dl.acm.org/citation.cfm?doid=2597073.2597078
- [14-9-2] http://dl.acm.org/citation.cfm?doid=2597073.2597086
- [14-10-2] http://dl.acm.org/citation.cfm?doid=2597073.2597105
- [14-10-3] http://dl.acm.org/citation.cfm?doid=2597073.2597112
- [14-10-5] http://dl.acm.org/citation.cfm?doid=2597073.2597088
- [14-10-6] http://dl.acm.org/citation.cfm?doid=2597073.2597089
- [14-10-8] http://dl.acm.org/citation.cfm?doid=2597073.2597095
- [14-10-9] http://dl.acm.org/citation.cfm?doid=2597073.2597090
- [14-10-10] http://dl.acm.org/citation.cfm?doid=2597073.2597091
- Bug reports
- Publications
- Tutorials
- Surveys
Data extraction activities
- SCS access and access to other data sources listed above
- Filtering
- Source code/revisions
- by language
- by size
- by date
- Communication
- [14-1-3] http://dl.acm.org/citation.cfm?doid=2597073.2597110
- [14-4-3] http://dl.acm.org/citation.cfm?doid=2597073.2597083
- by time
- by relevance
- by rating
- Bug reports
- by status
- by time
- by empty fields
- Other
- by stop words
- Source code/revisions
- Transformation
- Source code
- Camel case splitting
- Conversion to srcML
- Communication
- Stemming
- Bug reports
- Event log generation
- State minimization and cleaning
- Source code
- Abstraction
- Control charts
- Text detection (keyword, fuzzy line, fuzzy patch)
Acknowledgement
Thomas Bernau helped with collecting and tagging MSR papers with regard to categories of data sources and extra data extraction activities.