Publication Details
Abstract
The development of a linguistic foundation for corpus construction poses a number of important difficulties that may have an impact on the final dataset's quality and usability. The ambiguity in defining the corpus's scope and purpose is one of the main problems, which might cause the texts chosen to be out of alignment. This could lead to a corpus that is not representative enough to capture the variety of language use across various groups and circumstances.
Another difficulty is gathering data, especially when it comes to accessibility and copyright limitations that restrict the variety of texts that can be included. Additionally, if the corpus is unduly concentrated on particular genres or linguistic variants while ignoring others, sampling bias may result.