Ask a librarian

Dataset contains TF-IDF data matrices generated from "Ask a librarian" question/answer corpus and targeted for machine learning use. Corpus is in Finnish. Data matrices are especially suitable for training Extreme Multi-label Text Classification (XMTC) machine learning models.

The original corpus contains 3150 Finnish language relatively short documents from the service Kysy kirjastonhoitajalta (Ask a librarian). Each document is a question from the general public with an answer from a librarian.

The corpus was extracted from the collection of over 25000 question/answer pairs with the requirement that the document must have a minimum of 4 subjects.

The corpus has been split into the following directories:

all: contains all the documents (N=3150)
train: contains questions asked before 2016 (N=2625), intended for training
maui-train: random sample subset (N=200) of train, intended for training a Maui model
validate: contains questions asked in 2016 (N=213), intended for validating (e.g. choosing hyperparameters for a classifier)
test: contains questions asked in 2017 (N=312), intended for final evaluation

The original corpus is available from https://github.com/NatLibFi/Annif-corpora/tree/master/fulltext/kirjastonhoitaja

The Ask a Librarian service can be found here https://www.libraries.fi/ask. Libraries.fi is responsible for developing and maintaining the service.

Data resources

Additional Info

Collection Open Data
Maintainer CSC – IT Center For Science Ltd.
Maintainer email
  1. analytics@csc.fi
Links to additional information
  1. https://github.com/NatLibFi/Annif-corpora/tree/master/fulltext/kirjastonhoitaja
Update frequency
Last modified 04.02.2022
Show change log
Created on 21.12.2020