The HerLD Corpus

A primary goal of the Heritage Language Variation and Change Project (HLVC) is to construct a unique corpus of conversational speech in ten Heritage Languages spoken in the Greater Toronto Area. This corpus, the Heritage Language Documentation Corpus, or HerLD, contains recordings in the Heritage Languages of speakers representing three generations, and comparable homeland samples. Our goal is to record 40 heritage speakers, balanced for age and sex, for each of the three generations (and 20 speakers for languages where only two generations exist in Toronto, i.e., Korean and Faetar).

The HerLD corpus data is housed in Dataverse, an online repository.

Corpus Use Form: Anyone wishing to use data from the HerLD corpus must fill in this form. You will also need to sign up for a free account on Dataverse and provide your username when filling out the corpus use form.

HLVC Speaker sample

Distribution of 451 transcribed interviews

(updated March 2023)

Language	HOM	Gen1	Gen2	Gen3	Total by Language
Cantonese	14	18	25	6	63
Faetar	34	11	3	0	48
Hungarian	0	2	6	3	11
Italian	9	10	21	11	51
Korean	10 *	19	24	1	54
Polish	12	12	14	0	38
Portuguese	12	0	3	0	15
Russian	19 *	18	17	3	57
Tagalog	17	20	18	0	55
Ukrainian	17	10	17	15	59
Total by Generation	125	120	148	39	451

We have not collected Homeland Russian data. Instead we have adapted relevant portions of the Russian National Corpus (Institute of Russian Language, Russian Academy of Sciences, 2003-2023), whose owners we graciously thank.

Dr. Hikyoung Lee and her research team at KU collected the Korean Homeland data.

The below table displays the number of speakers per language currently on the HerLD corpus' Dataverse repository. Red totals indicate languages with fewer speakers on the Dataverse than our goal.

Language	Sex	HOM	Gen1	Gen2	Gen3	Total
Cantonese	M	8	9	14	3	66/40
Cantonese	F	10	9	10	3	66/40
Faetar	M	0	0	0	0	0/20
Faetar	F	0	0	0	0	0/20
Hungarian	M	0	1	1	1	10/40
Hungarian	F	0	1	5	1	10/40
Italian	M	14	7	10	5	75/40
Italian	F	16	5	12	6	75/40
Korean	M	0	8	11	0	40/20
Korean	F	0	8	12	1	40/20
Polish	M	4	3	6	0	32/40
Polish	F	7	5	7	0	32/40
Portuguese	M	5	1	1	0	22/40
Portuguese	F	8	0	7	0	22/40
Russian	M	0	7	9	1	38/40
Russian	F	0	11	8	2	38/40
Tagalog	M	7	9	6	0	59/40
Tagalog	F	9	14	14	0	59/40
Ukrainian	M	8	5	8	7	59/40
Ukrainian	F	9	5	9	8	59/40

For each speaker recorded, there are several types of data.

(See file-naming conventions.)

1. Primary content: a set of audio-recorded interviews with time-aligned transcriptions (IV)

Interviews average one-hour in length. Normally the speakers include one Heritage Language participant and one or two Heritage Language speaking interviewers. Sometimes members of the participant's family are present. All segments of the interview conducted in the Heritage Language are transcribed as fully as possible. Lengthy switches to English are not always transcribed. An English version of the interview modules is available here. It is also available in each Heritage language.

The audio interviews are in .wav format.

For access to the uncompressed .wav files, please fill out the Corpus Use Form mentioned above to request access to the HerLD corpus.
Access to the compressed .mp3 files is through the Corpora in the Classroom website, available with permission to people with UTORids.

Time-aligned transcriptions have been produced (or are being produced) for each interview.

Transcriptions are constructed using ELAN and are available as ELAN Annotation files, or .eaf files. (These are easily converted to many other formats.) All are UNICODE-compliant.

Transcriptions are aligned at the phrase level.

Switches to English (and other languages) are marked.

Cantonese is transcribed phonetically, using the Jyutping system. We are adding colloquial Cantonese character transcription.

Jyutping is as in Matthews, Stephen and Virginia Yip. 1994. Cantonese: a comprehensive grammar. London, New York: Routledge. 13-22.

Faetar is transcribed in IPA (International Phonetic Alphabet) and loosely translated into English.

Korean is transcribed in Hangul.

Hungarian is transcribed orthographically.

Italian is transcribed orthographically.

Polish is transcribed orthographically.

Portuguese is transcribed orthographically.

Russian and Ukrainian are transcribed using Comrie & Corbett’s Transliteration.

Comrie, Bernard & Greville Corbett. 2002. The Slavonic Languages. London & New York: Routledge. 827, 832-833.

Exceptions to this system: symbols with hachek (use zh, sh, ch, and shh instead), the 'hard sign' (#) and 'open e' (je).

These transcriptions can be automatically transliterated to Cyrillic at http://www.translit.ru.

Taglog is transcribed orthographically.

2. Participants respond orally to an Ethnic Orientation Questionnaire (EOQ)

Audio recordings of these are available (as _EOQ.wav files). [See file-naming conventions]
Some participants' responses are transcribed (in _EOQ.eaf files). Many are not fully transcribed. Rather, the question number is indicated in the transcript if the interviewer asked the question verbatim.
Responses to each question are numerically coded in a spreadsheet for all participants of one Heritage Language: EOQ_data_LANGUAGE_date.xls.

3. Participants name a set of pictures, the First Words task (FW)

Participants are asked to describe a sequence of pictures from a children's story book, naming common objects and then describing scenes containing these items.
The book used is:

Amery, Heather and Stephen Cartwright. 1987. The First Hundred Words. Tulsa: Educational Developmental Corporation.

Audio .wav, as well as .eaf transcription files are available for this task.

File naming conventions

Files labels have three parts:

Speaker code of the primary participants

Abbreviation indicating the type of interaction (preceded by an underscore):

_IV is the sociolinguistic interview, generally a lengthy, relaxed conversation

_EOQ is the oral administration of the Ethnic Orientation Questionnaire

_FW is a picture-description task, referred to as First Words

An extension indicating the file type (preceded by a dot or period):

.eaf is an ELAN annotation file, or transcript

.wav is an uncompressed audio recording

.mp3 is a compressed audio recording

Additional file types include Praat textgrids (.TextGrid) for acoustic analysis, .xls and .xslx for record-keeping of various types, e.g., the catalog.

Example: F1F29A_IV.eaf is the transcription of the sociolinguistic interview of speaker F1F29A.

Speaker labelling conventions

Each speaker is identified by a speakercode. The speakercode consists of five parts:

The first character identifies the heritage language of the speaker (C, F, H, I, K, P, L, R, T, U).

The second character identifies the heritage speaker's generation (1, 2 or 3). X or 0 identifies Homeland speakers

The third character identifies the speaker's sex (M or F).

The fourth and fifth characters give the speaker's age.

The final character (A, B, C, etc.) provide unique identifiers for otherwise identically-labeled speakers.

Example: F1F29A is a Faetar-speaking, first generation, female, 29-year old. She is the first such speaker recorded.

Website developed by Konstantin Shapoval, August 2013