A primary goal of the Heritage Language Variation and Change Project (HLVC) is to construct a unique corpus of conversational speech in ten Heritage Languages spoken in the Greater Toronto Area. This corpus, the Heritage Language Documentation Corpus, or HerLD, contains recordings in the Heritage Languages of speakers representing three generations, and comparable homeland samples. Our goal is to record 40 heritage speakers, balanced for age and sex, for each of the three generations (and 20 speakers for languages where only two generations exist in Toronto, i.e., Korean and Faetar).
Test version of HLVC Corpus in Dataverse
Distribution of 451 transcribed interviews | (updated March 2023) |
Language |
Total by Language |
||||
Cantonese |
14 |
18 |
25 |
6 |
63 |
Faetar |
34 |
11 |
3 |
0 |
48 |
Hungarian |
0 |
2 |
6 |
3 |
11 |
Italian |
9 |
10 |
21 |
11 |
51 |
Korean |
10 * |
19 |
24 |
1 |
54 |
Polish |
12 |
12 |
14 |
0 |
38 |
Portuguese |
12 |
0 |
3 |
0 |
15 |
Russian |
19 * |
18 |
17 |
3 |
57 |
Tagalog |
17 |
20 |
18 |
0 |
55 |
Ukrainian
|
17 |
10 |
17 |
15 |
59 |
Total by Generation |
125 |
120 |
148 |
39 |
451 |
We have not collected Homeland Russian data. Instead we have adapted relevant portions of the Russian National Corpus (Institute of Russian Language, Russian Academy of Sciences, 2003-2023), whose owners we graciously thank.
Dr. Hikyoung Lee and her research team at KU collected the Korean Homeland data.
(See file-naming conventions.)
The audio interviews are in .wav and .mp3 format.
Time-aligned transcriptions have been produced (or are being produced) for each interview.
- Transcriptions are constructed using ELAN and are available as ELAN Annotation files, or .eaf files. (These are easily converted to many other formats.) All are UNICODE-compliant.
- Transcriptions are aligned at the phrase level.
- Switches to English (and other languages) are marked.
- Cantonese is transcribed phonetically, using the Jyutping system. We are adding colloquial Cantonese character transcription.
- Jyutping is as in Matthews, Stephen and Virginia Yip. 1994. Cantonese: a comprehensive grammar. London, New York: Routledge. 13-22.
- Faetar is transcribed in IPA (International Phonetic Alphabet) and loosely translated into English.
- Korean is transcribed in Hangul.
- Hungarian is transcribed orthographically.
- Italian is transcribed orthographically.
- Polish is transcribed orthographically.
- Portuguese is transcribed orthographically.
- Russian and Ukrainian are transcribed using Comrie & Corbett’s Transliteration.
- Comrie, Bernard & Greville Corbett. 2002. The Slavonic Languages. London & New York: Routledge. 827, 832-833.
- Exceptions to this system: symbols with hachek (use zh, sh, ch, and shh instead), the 'hard sign' (#) and 'open e' (je).
- These transcriptions can be automatically transliterated to Cyrillic at http://www.translit.ru.
- Taglog is transcribed orthographically.
Amery, Heather and Stephen Cartwright. 1987. The First Hundred Words. Tulsa: Educational Developmental Corporation.
Audio .wav and .mp3, as well as .eaf transcription files are available for this task.
Files labels have three parts:
- Speaker code of the primary participants
- Abbreviation indicating the type of interaction (preceded by an underscore):
- _IV is the sociolinguistic interview, generally a lengthy, relaxed conversation
- _EOQ is the oral administration of the Ethnic Orientation Questionnaire
- _FW is a picture-description task, referred to as First Words
- An extension indicating the file type (preceded by a dot or period):
- .eaf is an ELAN annotation file, or transcript
- .wav is an uncompressed audio recording
- .mp3 is a compressed audio recording
- Additional file types include Praat textgrids (.TextGrid) for acoustic analysis, .xls and .xslx for record-keeping of various types, e.g., the2_2_linguists.php#catalog.
Example: F1F29A_IV.eaf is the transcription of the sociolinguistic interview of speaker F1F29A.
Each speaker is identified by a speakercode. The speakercode consists of five parts:
- The first character identifies the heritage language of the speaker (C, F, H, I, K, P, L, R, T, U).
- The second character identifies the heritage speaker's generation (1, 2 or 3). X or 0 identifies Homeland speakers
- The third character identifies the speaker's sex (M or F).
- The fourth and fifth characters give the speaker's age.
- The final character (A, B, C, etc.) provide unique identifiers for otherwise identically-labeled speakers.
Example: F1F29A is a Faetar-speaking, first generation, female, 29-year old. She is the first such speaker recorded.