Ask a librarian

Dataset contains TF-IDF data matrices generated from "Ask a librarian" question/answer corpus and targeted for machine learning use. Corpus is in Finnish. Data matrices are especially suitable for training Extreme Multi-label Text Classification (XMTC) machine learning models.

The original corpus contains 3150 Finnish language relatively short documents from the service Kysy kirjastonhoitajalta (Ask a librarian). Each document is a question from the general public with an answer from a librarian.

The corpus was extracted from the collection of over 25000 question/answer pairs with the requirement that the document must have a minimum of 4 subjects.

The corpus has been split into the following directories:

all: contains all the documents (N=3150)
train: contains questions asked before 2016 (N=2625), intended for training
maui-train: random sample subset (N=200) of train, intended for training a Maui model
validate: contains questions asked in 2016 (N=213), intended for validating (e.g. choosing hyperparameters for a classifier)
test: contains questions asked in 2017 (N=312), intended for final evaluation

The original corpus is available from https://github.com/NatLibFi/Annif-corpora/tree/master/fulltext/kirjastonhoitaja

The Ask a Librarian service can be found here https://www.libraries.fi/ask. Libraries.fi is responsible for developing and maintaining the service.

Data resources

TXT
Data matrix (test) for training XMTC machine...
Data matrix for training XMTC machine learning models (TF-IDF) with Voikko...

Download
TXT
Data matrix for training XMTC machine learning...
Data matrix for training XMTC machine learning models (TF-IDF) with TNPP...

Download
TXT
Data matrix (test) for training XMTC machine...
Data matrix for training XMTC machine learning models (TF-IDF) with TNPP...

Download
TXT
Data matrix for training XMTC machine learning...
Data matrix for training XMTC machine learning models (TF-IDF) with Voikko...

Download

Additional Info

Collection	Open Data
Maintainer	CSC – IT Center For Science Ltd.
Maintainer email	analytics@csc.fi
Links to additional information	https://github.com/NatLibFi/Annif-corpora/tree/master/fulltext/kirjastonhoitaja
Update frequency	irregular
Last modified	04.02.2022 Show change log
Created on	21.12.2020

Keywords

License

To the extent possible under law CSC – IT Center For Science Ltd. has waived all copyright and related or neighboring rights to Kysy kirjastonhoitajalta.

Openess score

Like this dataset

Stats

Weekly visits for last 12 months

Page visits:: During last 30 days: 1; During last 12 months: 10; All time: 133
Download counts:: During last 30 days: 0; During last 12 months: 0; All time: 6