approved
GLM for BRCA

The dataset comprises labeled genetic variations of the BRCA1 and BRCA2 genes obtained from the ClinVar database. After refining annotations based on quality and completeness (SPDI and pathogenicity fields), artificially generated mutated DNA sequences were created using the reference genome GRCh38. Mutations were introduced to segments (300bp or 500bp) at designated locations, and sequences were categorized as 0 (benign) or 1 (pathogenic) according to annotations. Consequently, sequences that had high-quality review statuses (e.g., expert panel, multiple reviewers) were chosen for fine-tuning the DNABERT2 pre-trained model. To handle class imbalance, extra benign sequences were obtained from the wild-type reference employing a sliding window method. The final dataset contained 4,213 sequences for BRCA1 and 5,996 for BRCA2, further balanced to 5,397 and 6,701 samples, respectively.

Tags
Data and Resources
To access the resources you must log in
Personal Data Attributes

Description: Personal Data related Information

Field Value
Anonymised Anonymized
ChildrenData No
General Data Yes
Personal Data No
Personal data was manifestly made public by the data subject No
Sensitive Data No
Additional Info
Field Value
Accessibility Both
Basic rights Download
Creation Date 2025-01-15 17:00
Creator Masci, Leonardo
Data sharing agreement yes
Dataset Citation Leonardo Masci, GLM for BRCA Dataset
Field/Scope of use Any use
Group Health Studies
License term 2025-01-29 17:40/2030-01-29 17:40
Processing Degree Primary
SoBigData Node SoBigData EU
SoBigData Node SoBigData IT
Sublicense rights No
Territory of use World Wide
Thematic Cluster Other
system:type Dataset
Management Info
Field Value
Author d'Aloisio Giordano
Maintainer Masci Leonardo
Version 1
Last Updated 1 March 2025, 21:07 (CET)
Created 1 March 2025, 21:07 (CET)