Keven’s study consists of 3 phases: "In the first phase of my research, I collected the GitHub profiles of our researchers. As there is no central database with GitHub profiles of researchers from Utrecht University (UU), we had to collect the users from various sources. I searched GitHub by Utrecht University and collected the information. We also searched for GitHub profiles in the data of PURE. Then we searched ’ paperswithcode.com ’. When you limit that to Utrecht University you can find the papers from scientists in Utrecht. The last source we used, was the employees page
Analysing with SWORDSKeven Quach’ master thesis is called: ’Mapping Research Software Landscapes through Exploratory Studies of GitHub Data’. He performed his research as part of the Open Science Programme. was already working with Professor Anna-Lena Lamprecht and Jonathan de Bruin as a research assistant to develop SWORDS prior to his thesis.
SWORDS stands for ’Scan and revieW of Open Research Data and Software’. SWORDS is a powerful tool to gain insight into the open source activities of a university or research institute. The thesis provided the SWORDS framework with additional variables. Although the analysis and data collection were done for UU researchers only, the purpose of this research is to serve as a template for other researchers to scan and review repositories for their university or organisation as well.
Donkey workAs a second step, Keven collected all code and software repositories. Keven merged all the information he had about the researchers and their GitHub profiles and then started his real donkey work. He went through all the repositories manually to check if the software that was published was either research software or software made for someone’s hobby. By checking all the software, he made sure he provided his research with an overview of research code and software. "We found 1500 repositories in total. I manually labelled all these repositories. Doing so is extremely tedious, I can tell you now from experience," says Keven laughing.
34% of the research code and software doesn’t have any license information. If someone else wants to be able to work with research code and software, you need a license that permits reuse.
Graduate Keven Quach, master business informatics
Who is Keven Quach?Keven Quach (1996) was born and raised in Germany. His parents came as refugees from Vietnam to our eastern neighbours in the seventies. After high school he attended the University of Bamberg in Bavaria, where Keven did a bachelor’s degree programme in business informatics. He wanted to do his master abroad and chose business informatics at Utrecht University, a study he found the most interesting Since his graduation in November last year, Keven has been working s’as a software engineer at Bosch, Friedrichshafen, back in Germany.
One hundred repositories per dayKeven thought he could do one hundred repositories per day, but he was being too optimistic. He needed about five to six weeks to go through all the repositories by hand. "Research software is often work in progress", Keven continues, "We needed some way to find out if the repository that we have in our dataset is really a research code or a software repository. Identifying if something is research output or not, was a challenge. And it’s not that simple to do this automatically. Sometimes I needed to contact the researchers to ask them if it was research software or not." áBy labelling these repositories, Keven also looked into the extra data. That way he really got to know the dataset quite well.
In the third phase of his research, Keven looked at different variables, such as:
- Does it have a license?
- Is version control used correctly?
- Is citation information available?
That way he could analyse how FAIR the software and code are. These are some of the results:
"In the analyses of the research software we added a FAIRness score. We then added the score of each repository as one value and then we averaged that by use for each faculty. We also did this by distinguishing different types of research software."
Remarkable findingsAfter collecting all this information, Keven started analysing the data. He looked at all kinds of aspects of the publications, such as quality, FAIRness, and popularity of the research software. Keven showed for example to which faculty each publication belonged. "It was remarkable that I did not find any repositories from the Faculty of Law, Economics and Governance. And the Faculty of Veterinary Medicine had less than ten published repositories." So those two faculties were excluded from further analysis.
There are two likely reasons for this. "The first one is that our search is very biased towards the other faculties in the way we collect users. For example, if we go back to the previous search strategies, ’PapersWithCode.com’ is quite heavily biased towards machine learning. Therefore, the Science Faculty and most likely no publisher from Veterinary Medicine will use this kind of website. And, of course, not 100% of the code and software is on GitHub. So there probably will be an unknown unreported number of research software that exists, but that we do not know of due to the way we have captured this. The other explanation is that some faculties simply do not use that much research software."
The Faculty of Medicine was also not included in Keven’s research, since these researchers work at UMCU and they cannot be found on the employee pages of Utrecht University. The faculty with the two largest GitHub accounts in term of repositories is Humanities, namely the Digital Humanities Lab and the Institute for Language Sciences Labs. "In the first Lab they already have 80 repositories and in the latter even more, 140."
License for reuseKeven found that 66% of the research code and software had an open license. "That means that 34% doesn’t have an open license. If you don’t use a licence and you publish something, no one can use the code legally. It’s protected by default. So, unless you give it an open licence, no one can use your work. You need a license to permit reuse. One of my recommendations is to inform researchers to add a license to their research software. It’s something that’s relatively simple to do."
Find more information about licensing and publishing your data and software:
Programming languagesKeven also made an overview of the programming languages that are used by Utrecht researchers. He wanted to know if languages used were free and open or commercial and closed (e.g., Python versus MATLAB). Python and R are the most commonly used programming languages. These are open source languages, which can be widely reused. Python is most frequently used within the Science Faculty, the Faculty of Social Sciences is the largest user of R.
How FAIR is the software developed by Utrecht researchers?"Researchers at Utrecht University work relatively FAIR when it comes to research software", Keven says. "We see that the Support Departments (e.g. University Library and ITS) and the Faculty of Social Sciences perform the best. So that’s why I said ’relatively’ because we can only see that in relation to the other faculties we examined. To publish more FAIR, colleagues from the Support Departments and the Faculty of Social Sciences can show others what they do right or how they can work more FAIR. The ultimate goal of FAIR is to facilitate the reuse of data, code and software and that’s still an ungoing process at the university."
What we can do with these findingsAccording to Jonathan de Bruin, based on the results of Keven’s master thesis, the following actions are to be taken:
- We can provide proactive support in faculties where little output can be found.
- We can tackle structural problems with quality in an integrated way.
- We can create awareness within the organisation.
- We can provide researchers with more or better information than we do now.
We aim to repeat the study after a year and thus monitor the impact.