Results
Common Polish Y-DNA Clades
28 Mar 2015
Peter Gwozdz
28 Mar 2015: Minor edits throughout.
26 Mar 2015: Discussion topics finished. This new web page is finished for now - until
the next major update - except for possible edits - let me know if you have an
edit suggestion.
20 Mar 2015: New web page, using data downloaded from the
Polish Project 21 Jan 2015. This is a full update companion to Polish Clades.
Polish Y-DNA Haplogroup Summary Table
Edited 28 Mar 2015:
This table displays haplogroups that are most common in
Poland.
More detailed discussion is below
the table. There are statistical issues and bias
issues regarding tables such as these, so the percent frequencies should be
viewed as approximate; percent is given
to one decimal place in order to clearly communicate relative frequencies in
the database analysis.
For full details of the data and
analysis, see the file ResultsTableNew.xls.
For a more basic discussion about
what all this means, please see the Abstract
in the main web page.
Haplogroup |
% |
Haplogroup |
% |
Haplogroup |
% |
Haplogroup |
% |
Haplogroup |
% |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
↑ C |
↑ 0.3% |
|
|
|
|
|
|
CDEFGH |
7.8% |
E |
4.9% |
E1b1b1a1b1a (V13) |
2.6% |
|
|
|
|
|
|
FGH |
2.6% |
|
|
|
|
|
|
|
|
I1a (DF29) |
6.4% |
I1a1 (CTS6364) |
3.2% |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Others |
37.7% |
I J |
22.9% |
I2 (P215) |
9.4% |
I2a1 (P37) |
7.2% |
S17250 |
4.0% |
|
|
|
|
|
|
|
|
S4460 |
1.2% |
|
|
|
|
|
|
I2a2 (M223) |
1.7% |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
J (M304) |
7.2% |
J1 (M267) |
2.5% |
|
|
|
|
|
|
|
|
J2 (M172) |
4.7% |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
N1c1a1a1a |
|
|
|
|
|
LTNOQ |
7.0% |
N (M231) |
5.5% |
L550 |
4.9% |
Z16981 |
2.6% |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
R1a1a1b1a1a |
13.2% |
R1a1a1b1a1a1a |
10.0% |
Y2905 |
3.3% |
|
|
|
|
|
|
|
|
Y4135 |
2.0% |
R1a1 (M459) |
50.3% |
R1a1a1b1a1 |
22.6% |
(L260) P type |
|
(YP254) |
|
Y414 |
4.7% |
|
|
|
|
|
|
|
|
|
|
|
|
(M458) |
|
R1a1a1b1a1b (CTS11962) |
9.0% |
R1a1a1b1a1b1 (L1029) |
6.3% |
|
|
|
|
|
|
|
|
|
|
YP593 |
2.9% |
|
|
|
|
N type |
|
YP515 |
2.7% |
|
|
|
|
|
|
R1a1a1b1a2a (Z92) |
4.8% |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
R1a1a1b1a2b1 (YP340) |
2.1% |
|
|
|
|
|
|
|
|
|
|
D type (Y2613) |
1.8% |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
I type (S18681) |
3.1% |
|
|
R1a1a1b1a2 |
|
R1a1a1b1a2b |
|
R1a1a1b1a2b3 |
|
B type (Y2902) |
3.1% |
|
|
|
24.7% |
|
19.2% |
|
17.2% |
J type (YP237) |
3.0% |
|
|
(Z280) |
|
(CTS1211) |
|
(CTS3402) |
|
G type (L365) |
3.8% |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
R1a1a1b1a3(Z93) |
2.2% |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
R1b1a2 (M269) |
11.8% |
|
|
|
|
|
|
|
|
|
|
R1b1a2a1a1 (U106) |
5.1% |
R1b1a2a1a1c2b (L48) |
2.9% |
R1b1a2a1a1c2b1 (L47) |
1.7% |
|
|
|
|
|
|
|
|
R1b1a2a1a1c2b2 (Z9) |
1.2% |
|
|
|
|
|
|
DF27 |
1.3% |
|
|
|
|
|
|
R1b1a2a1a2 (P312) |
3.3% |
U152 |
1.8% |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
R2 ↓ |
↓ 0.2% |
R1b1a2a2 (Z2103) |
2.0 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Total |
100% |
|
|
|
|
|
|
|
|
Discussion of
the Haplogroup Summary Table
Edited 28 Mar 2015:
The table shows the percent samples (men) for each haplogroup. The total database: those samples in the Polish Project that indicate
“Poland” as the earliest known male line ancestor’s origin.
There are issues involved with “Poland”
because the borders of Poland have changed often, and Poland as a country did
not exist for more than a century; I
discuss some of these issues in the two “Poland”
topics below.
There are statistical issues
involved with percentages calculated from such a database; I discuss some of these issues in the Statistical topic below.
The old 2013 version of this table
is still available at Results.html. That old version is based on the entire
Polish Project, taken as representative of historical
Poland. You can compare the
different percent results for that old version to this new version, which is
restricted to “Poland”.
This new version also has many new
haplogroups, corresponding to new SNPs. There has been a flood of new SNPs over the
past several months, so this new version has more detailed categories. Watch this page for future updates.
For more details, my analysis file ResultsTableNew.xls
is also available on-line.
I tried a few versions of different
browsers to view the Table above.
Almost all browsers display the table as I intend, but there are
exceptions with minor problems. Sorry
if your browser has display problems, for example by including lines within the
cells, where I intend each cell to represent a haplogroup, solid color,
bordered by lines in a rectangle. The
cells to the right are subdivision haplogroups. Like a tree, but with branches going right.
Very small haplogroups, and residual
paragroups, are white, not displayed
as cells, no percent displayed.
The height of each haplogroup cell
is proportional to the percent. The
full table is 10 inches, so 1% is a tenth of an inch. 1.2% is my minimum cell for display. However, I include smaller haplogroups and paragroups as rows
with short height without text. For the
actual %, please see ResultsTableNew.xls,
which has the complete list of haplogroups and paragroups, with percent, and
with all percents in each column adding to 100%.
The font of the table can be small
on cell phones, but you can use your fingers to expand to read, and then move
around the table. On desktops, change
the Zoom setting as needed to view the small font cells.
The left side of the table is based
on all samples, including those with only 12 markers, so the statistics is
better on the left. The far right of
the table is based proportionally on only those samples for which the
appropriate SNPs have been tested, because those haplogroups cannot be
predicted by STRs, so the right side is
based on fewer samples, which means lower accuracy. Please consider the far right column as estimates based on few
samples; the accuracy will improve in
the near future as more samples are tested for these new SNPs. The general central region of the table is
intermediate, based proportionally on all samples with 67 STR markers, for
those haplogroups that can be confidently predicted based on 67 STRs. For more details, please see the following
topic.
The ISOGG (http://isogg.org/tree/) codes in the table
are based on the ISOGG tree in late Feb / early Mar, 2015; the ISOGG codes change as new SNPs are
discovered. The far right column in the
table does not have ISOGG codes because those change quickly and some of those
are not yet in the ISOGG tree. The
Yfull tree (http://www.yfull.com/tree/)
was used for the right column.
Edited 28 Mar 2015:
Please refer to my analysis file ResultsTableNew.xls.
The last sheet of that analysis
file, far right, is the “Summary STR 67” sheet. This sheet provides the data for the Haplogroup Summary Table at
the top of this web page. This sheet
includes the labels and % for all cells, including the cells that are white
(blank, no text) in the Haplogroup Summary Table.
The database is edited
for Family Sets as mentioned below with a link to
more explanation. That analysis file
has the full download in the first sheet, far left, “Project”, which has all
3,102 samples from the Polish Project with STR data on 21 Jan 2015. The edit code is column K. The edit data (for removal) are the 110
samples sorted to the top. The second
sheet, “ISOGG Sequence Full” has the remaining 2,992 samples; that sheet was not used for the table; it is for comparison to the following
sheets.
The “Project” sheet has unedited
samples with “Poland” origin in column G sorted to the bottom. These 1,431 samples, the database for the
table, are copied to the third sheet, “Poland Edit” for analysis of categories; there are 259 assignment categories. For each category, the first sample from
each category, along with the number of samples, are copied to the next sheet,
“Poland ISOGG Sequence STR 12” for analysis.
Categories are grouped together into large haplogroups. The next sheet “Summary STR 12” has the data
summaries. This is for the left side of
the Summary Table at the top of this web page.
Those are the haplogroups that can
be confidently divided into a tree
of haplogroups based on the minimum
12 STR markers. Some (not all) of
those can be confidently divided into branching haplogroups using 67 STR
markers. For these cases, all the
samples with fewer than 67 markers need to be removed
from the database, then the percent of the haplogroup (the 12 marker haplogroup
being subdivided) can be proportioned to the branches according to the percent
of 67 (or more) marker samples in each branch.
The sheet “Edit 67” has the database of samples with 67 or more,
extracted from the “Project” sequence, where the “Poland” samples are sorted by
number of markers, so the ones required are at the bottom; there are 763 such samples, in 235
categories. This “Edit 67” sheet
arranges the samples for extraction to the following sheet, “Category Sequence
67”, extracting only the first sample from each category along with the number
of samples in that category.
The sheets to the right named “I1”,
R1a” and “R1b” have corresponding data copied from “Category Sequence 67” as
labeled in cell B2: “Data from the Category Sequence 67 sheet.” Each of these has at least one sample that
cannot be predicted based on 67 markers;
such samples are identified in the sheets and left out of the
totals. Leaving these out of the totals
is equivalent to assigning them to the various 67 marker categories,
proportionally to the number of samples in each category, because in each sheet
the % weight for each sample (see cell D4 in each sheet) is calculated to make
the total percent for that sheet the same as the corresponding parent
haplogroup (based on 12 marker data).
There is a statistical issue associated with this proportional
assignment, discussed below in the Statistical
topic.
Similarly, haplogroups can be
further subdivided based on SNP results, restricting the data to those samples
that have been tested for the appropriate SNPs. Again, removing samples not appropriately tested is equivalent to
assigning them proportionally to the subdivision branches. Again, the statistical issue is discussed
below. Such SNP subdivision can copy
data from either sheet “Poland ISOGG Sequence STR 12” or “Category Sequence
67”. Each haplogroup so divided has a
sheet named for that haplogroup, with sheets arranged left to right according
to order in the Summary Table top to bottom.
Again, cell B2 documents where the data came from; again, D4 gives the % weight for each
sample.
The data from all sheets is
summarized in the last sheet on the right, Summary STR 67, which was started as
a copy of the “Summary STR 12” sheet, with the 67 marker based and SNP based
data added.
Edited 28 Mar 2015:
The Polish Project is open to all men
who trace their male line ancestry to the region of historical Poland. “Historical” Poland was much larger than
Poland as defined by its current boundaries.
I discuss this further in the following topic. In my Fall 2009
publication I discuss reasons for considering the Polish Project as
representative of historical Poland.
Accordingly, my table of haplogroups have in the past been based on all
the Polish Project samples with Y-DNA STR
data. This 2015 update is a
change. I now restrict the data to
samples that indicate “Poland” as the place of origin for their earliest known
male line ancestor. There are issues
with the use of “Poland”. These issues
are discussed in the next topic. This
topic provides my reasons for the change.
The basic reason: the Polish Project now has too many samples
from the countries bordering Poland.
Many if not most of these are probably not from the region of historical
Poland.
There are good reasons for allowing
men to join the Polish Project even with ancestry from regions somewhat beyond
the largest borders of historical Poland.
The Y-haplogroups for these regions are present in Poland, and more data
is helpful in the task of determining the branching of haplogroups. Discovery of the branching SNPs is one goal of the Polish Project.
Another reason: Mayka
has been doing a great job of analysis of Polish Project data, and keeping the
web page up to date. This has attracted
men from border countries to join the project, in order to learn their
haplogroup branch, and in order to see the names of their distant male line
relatives. This fills another goal of
the Polish Project. Some of the
neighboring countries do not have such well managed projects.
A big reason: Effort:
It takes a lot of effort for Mayka to determine which samples are not
from the region of historical Poland.
In most cases the men need to be contacted by email. Mayka had been analyzing such samples, and flagging them as “No Release”
so they are not displayed in the public web pages. With the growth of the Polish Project, this effort has become too
great. Samples obviously not connected
to the region of Europe near Poland are still flagged with “No Release”.
I mention above that more data is
helpful. Restriction to “Poland” means
less data, which means less statistical accuracy. This was another of my reasons for considering the full Polish
Project in the past. The Polish Project
has grown a lot over the years, however, so the restriction to “Poland” is no
longer a very serious issue.
There are other Y-DNA projects that
could be combined with the Polish Project data to produce a larger “Poland”
database. However, men with “Poland”
provided for origin who do not join the Polish Project may be relatively less
certain of their Polish male line ancestry, so the Polish Project may well serve
as a more concentrated database regarding Polish ethnicity. Ethnicity is discussed in the following
topics.
There are issues involved with using
self assignment to “Poland” as the criterion for a database. These issues are discussed in the following
topic. However, there are also issues
in using self assignment to a geographical region as the criterion, so my
recent change to “Poland” is primarily a change of issues and interpretation,
as discussed in the following topics.
Edited 28 Mar 2015:
For the Haplogroup Summary Table, my
database is those samples from the Polish
Project that indicate “Poland” as the earliest known male line ancestor’s
origin. There are issues with this use
of “Poland”:
Partition of Poland: “Austria” or “Russia” or “Prussia”
or “Germany” were used by many Polish immigrants in other countries (mainly USA
around 1900) in documentation of place of birth, because Poland did not exist
from 1795 to 1918. Poland was split
three ways. The Austrian sector was in
the south. The Russian sector was in
the east. The Prussian (now German)
sector was in the west. Some men, when
submitting their DNA data, correct this and enter “Poland” if they consider
their male line Polish, while others who consider their male line Polish just
copy the documentation for origin as reported.
The uncorrected data reduces the “Poland” database and thereby reduces
the accuracy of haplogroup percentages.
Data may have a bias if correction from certain sectors is more
common. For example, maybe Polish immigrants
from Russia said “Poland” as their origin less often than Polish immigrants
from Austria said “Poland”; or
vice-versa; we don’t know which way it
might bias. Insofar as some haplogroups
are more common in some sectors than others, this may introduce a bias in the
percentages of haplogroups in Poland.
Although perhaps not very large, this bias may well be larger than some
strictly random statistical issues (below).
In the time between WWI and WWII,
a large part of what is now Belarus and Ukraine was in the eastern part of
Poland. Again, some men who consider
their male line Polish may correct their documentation to “Poland”, if the
documentation says “Belarus” or “Ukraine”.
Here, there is also the opposite problem: some men from Belarus or Ukraine who do not consider their male
line Polish may enter “Poland” if their documentation so indicates. Similarly, a large part of what is now
western Poland was Germany between WWI and WWII.
Commonwealth. During the
centuries before the partition of Poland, Poland was a “Commonwealth” with an
even larger area. The Commonwealth of
Poland was in existence well before the era of nationalism; the Commonwealth was more like the United
States today; people who lived in the
Commonwealth were considered Polish because they lived in the
Commonwealth. These people were a mix
of many different nationalities (ethnicities).
Up until WWII, in the region considered “Poland”, there were many
neighborhoods of people who spoke a language other than Polish, with a
different culture; these people may or
may not have considered themselves Polish.
Similarly, there were many neighborhoods of people who spoke Polish in
countries outside the then current Polish borders. Today, because of WWII and the aftermath, Poland is not very
ethnically diverse by comparison.
Geographical vs Ethnic DNA data. In this sense, the Polish Project is more representative of
historical Poland than current Poland.
The Polish Project is not solely representative of ethnic Polish
people. Restriction to “Poland” samples
provides a more ethnic database. More
discussion about this is below.
Polish family names can be seen in DNA data with male line
origin given as USA, or England, or other countries not close to Poland. Obviously, these are mostly men who have not
found documentation for their very distant ancestors, so they state the birthplace
of their known ancestor, perhaps a father or grandfather. I considered increasing the database by
including data with Polish family names, but that is difficult to do, and that
produces a problem that I consider overwhelming: many Polish people have German or Russian sounding names even
though their ancestors were all Polish for as many generations back that they
can trace. Some of these may be male
lines that moved to Poland long ago from elsewhere, but many if not most of
these are probably male lines where the family names were Germanized or
Russianized, in the same way that many Polish names today are Anglicized in the
USA. I saw many instances of such
family name changes in 19th century Poland when studying the Polish microfilms
of vital records while researching my family tree. So I do not add Y-DNA data from other countries given as origin
based on Polish names, and I do not remove “Poland” data for names that to not
sound Polish. Such a name based study
may be an interesting experiment.
The Polish Project has a very low
percentage of samples with non-Polish names, from a man who knows his father or
grandfather was adopted from an orphanage, and where there is strong suspicion
that his male line is Polish. Or other
similar special situations. Although
rare, these indicate how difficult it is to restrict data to Polish origin
without the danger excluding someone with a genuine interest in the Polish
Project.
Geographical issues such as
the previous paragraphs are not unique to Poland. The same types of issues would come up in an analysis of Y-DNA
from most countries. But these issues
are particularly applicable to Poland, because of the wide swings in the border
of the country of Poland over the centuries, and because of the large Polonia
diaspora, and because Poland was very cosmopolitan in the past.
Biskupski published a
delightful essay regarding the ambiguity of the word “Polish”. (Biskupski M, “Who is a Pole and Where is
Poland?” Rodziny - The Journal of the
Polish Genealogical Society of America (PGSA) Summer 2006:5-12 - can be
purchased through PGSA.) Biskupski
considers all these issues, and also other ambiguities that I do not cover in
this brief discussion. Biskupski does
not provide a definition of “Polish”;
in fact he avoids such a definition, preferring to let people define
themselves as Polish because “it feels right”, quoting Tuwim: “Yestem Polakiem bo tak mi się
podoba.” “I am a Pole because
that feels right to me.” Note that
I follow Biskupski’s advice in the paragraphs above with the phrase “consider
their male line Polish”. Biskupski
points out the problem with restricting “Polish” to the current borders of
Poland: with such a restriction famous
Poles such as Kosciusko, Sobieski, Paderewski, and others would not be
considered Poles. However, Biskupski
points that out if we expand the borders to historical Poland, then
Tchaikovksy, Dostoevski, Nietzche, and others become Poles. Clearly, borders cannot be used alone as a
supposed definition of who is a Pole.
Self assignment as origin
“Poland” can of course be a problem.
Any study based on a self assigned database will have difficulty being
published in a peer reviewed journal.
No doubt some people are wrong about their male line ancestry, or self
deluding. It may be that men uncertain
of their origin may be more likely to consider Y-DNA tests and more likely to
join the various projects, so maybe DNA projects have a higher concentration of
uncertainty than the general population.
On the other hand, men who state “Poland” as origin are surely more
likely to be of Polish ethnicity than men in the Polish Project who state
another country. Conversely, men who do
not consider themselves Polish may join the Polish Project because their male
ancestor came from the region of historical Poland, and those men are more
likely to not state “Poland” for the origin.
In this regard, my 2015 table is more ethnically Polish than my previous
tables. I have evidence for this, in
the following topic.
Data Analysis
Edited 28 Mar 2015:
I use L260 as an example; the same kind of analysis can be done for
many of the haplogroups at the top of
this page, regarding evidence of ethnic concentration.
L260 is the largest Y-DNA haplogroup
with significant concentration in Poland and with much lower concentration
elsewhere. Before L260 was discovered,
I identified this clade (hypothetical
haplogroup) with the name P type. In my Fall
2009 publication I reported the frequency of P type Y-DNA as “about 8%” in Poland. In my discussion I pointed out that 7.6% was
my calculated result for the Polish
Project, representative of historical Poland, and that among ethnic
Poles the frequency might be slightly higher, justifying my 8% estimate in the
abstract of the paper. In Fall 2010
(same journal, JOGG) I reported the discovery of L260 and showed that it is
almost equal to P type, with very few P type outliers
(samples that marginally match P type with 67 STRs
but test negative for L260, and even fewer L260+ samples that do not match P
type.).
P type L260 has gained frequency in
the Polish Project over the years. The
file ResultsTableNew.xls
has the Polish Project data available in the first sheet, “Project”, where
there are 2992 samples after editing (no E in column K) of which 1757 have 67
or more markers (column H). Of these,
samples with Assignment (column N) containing “L260” number 161, so the percent
L260 is 161/1757 = 9.2%. I’m thinking
this increase from 7.6% to 9.2% might be due to an “attraction” to L260. I’m thinking, due to the web visibility of
L260, men that match L260 may be more likely to join the Polish project than
men who do not match any of the common Polish haplogroups. Men know their matches because their FTDNA home page displays their closest Y-DNA
matches; also, STR matching can be done
on-line. Some men might purchase DNA
tests based on chips with thousands of markers (not just Y markers), in which
L260 is usually included. I’m not sure
of this conclusion; the 1.4% increase
might be due to random statistics if the 7.6% result just happens to have a
lower proportion of L260 just by luck, while the 9.2% might have a lucky higher
proportion. Also, there are other
possible non-random biases, discussed above and below at this web page, that
might cause a bias toward a slight increase in L260.
However, most biases of concern are in regard to Polish ethnicity, which we can analyze separately: In the current analysis, restricted to “Poland”, the table at the top of this web page shows L260 P type as 13.2 %, significantly higher that 9.2% for the entire project. Some of that may be statistical variation, but 4% is unlikely to be just statistical variation. It seems more likely to me that 4% roughly represents the increased concentration when data is restricted to “Poland” origin. In other words, 4% seems like a rough estimate for the greater concentration in ethnic Poles compared to the full Polish Project. My “slightly higher” adjustment back in 2009, less than 1%, seems to have been too conservative. It seems P type (L260) is noticeably more concentrated among ethnic Poles, even in the region around historical Poland.
Other haplogroups, less frequent
among ethnic Poles, are coming out lower in the new table. These can be seen by comparing the current
table at the top of this web page to my 2013 table at Results.html. Examples:
Haplogroup E was 6.0% in the 2013 Polish Project, but is only 4.9% now
when restricted to “Poland”. Haplogroup
N was 8.8% in the 2013 Polish Project, but is only 5.5% now when restricted to
“Poland”. Both E and N are widely
distributed European haplogroups, which apparently do not happen to have any
large sub group concentrated in Poland.
Edited 28 Mar 2015:
I am aware that Y-DNA data can be
misinterpreted with regard to ethnicity.
There has been a lot of migration of ethnic peoples in Europe over the
centuries. Ethnic peoples have
mixed. DNA has been scrambled due to
recombination, but Y-DNA is unusual because the Y chromosome does not recombine
(except for a very small part). The Y
chromosome alone does not indicate ethnicity for individuals. Each of us has more than 1,000 ancestors
about 300 years ago. (At 10
generations, 2 to the 10th is 1,024, so each of us has 1,024 branches in our
pedigree tree for ancestors going back 10 generations. 30 years is a bit high per generation on the
average, so the number of ancestors is greater than 1,024 at 300 years.) Our Y chromosome is associated with our male
line, which is only one of those 1,000+ ancestral lines 300 years ago. Going back more than 300 years is much more
dramatic.
That said, the Y is interesting
because certain Y haplogroups are concentrated in certain geographical regions,
which is evidence of concentration in the past, and which is a hint regarding
origins and migrations. P type (L260)
is one of the more dramatic, with high concentration only in Poland, among
ethnic Poles. Obviously, the mixing of
people was not enough to homogenize the Y chromosome across the continent, or
even across the region of the Polish Commonwealth.
Edited 28 Mar 2015:
Small number of samples: When % frequency of a small haplogroup is calculated, based on a
small number of samples, there is a well-known uncertainty in the result,
expressed as a confidence range. Poisson
statistics is used for calculation. I
explain this in my Fall 2009
publication about types, and I provide
a macro for calculation of confidence range in my Excel files for types; the master file is Types.xls, with a
demonstration of the Poisson statistics macro on the far right of sheet
“SBP”. As a small sample example, the
very short cell at the top of this web page, haplogroup C (top haplogroup, not
colored), has only 4 samples, with frequency 4/1431 = 0.28%. The 70% confidence range is 0.14% to
0.51%; the 90% confidence range is
0.10% to 0.64%.
Technical caveat: my macro only considers the number of
samples in the haplogroup or type, not considering the uncertainty due to the
number of samples in the database, because the latter is usually very
large. My macro is only approximate for
a frequency where both the numerator and denominator are small.
Some of the short white rows (cells
not colored, no text with percent) in that table are based on fewer than 10
samples, with wide range of confidence.
The uncertainty is not a serious issue for those short cells. The taller white cells have better confidence,
based on a paragroup combination of
small haplogroups.
An example of a colored cell with
highly uncertain result, on the far right, Y4135: With only 3 samples, the
frequency is 2%, because the database in this case is the father haplogroup
YP254 with only 15 samples that have been adequately tested for the relevant
new SNPs. YP254 is 10.0% of the full
database, and 3/15 times 10% gives 2% for Y4135.
In general, the left side of the
table is based on hundreds of samples and the center of the table is based on
dozens of samples. Uncertainties due to
number of samples are very small compared to the other statistical issues,
below, and compared to the non-random biases discussed above. Except on the right side:
The right side of the table has
cells with lower sample size. Only 5
colored cells are based on fewer than 10 samples: S17250, Y2905, Y4135, YP414, and YP593. These are preliminary results for new SNPs; the next update should have much better
statistics as more samples are SNP tested.
However, I’ll probably add another uncertain column on the right at the
next update.
See the “Summary STR 67” sheet in ResultsTableNew.xls
for a table including all sample sizes;
{braces} are used for those haplogroups and paragroups that are not
displayed here (white rows in the table above).
The reason I have not calculated and
tabulated confidence ranges: I do that
in my table with Poland
Concentration Index for the haplogroups most concentrated in Poland. That “PCI” table needs to be updated. I figured it is more important to first get
this overall table updated and on line and well documented.
Recruitment Bias; Edits for Family
Sets: My explanation of editing for Family Sets is at the main web page.
Briefly, family sets produce a bias to haplogroup frequencies. This is a non-random statistical issue that
can be reasonably corrected by editing.
The first sheet “Project” in ResultsTableNew.xls
has my edit code in column K. The 110
“E” samples are removed from data analysis, as detailed above.
Proportional
Distribution; 67 marker data; SNP data. As explained above, STR data with fewer than 67 markers are
removed from the database for subdivision of haplogroups into branches, where
possible based on 67 markers. There can
be a bias, because there are sometimes samples with 67 markers that cannot be
assigned to any of the branches, and these are also dropped. Dropping samples is equivalent to
distributing them to the branches, proportional to the number of samples
assigned to the branches. However, one
or more of the branches might be more isolated
in haplospace. In other words,
particular branches might be more predictable with STRs, having relatively
fewer marginal assignments; other
branches might overlap more based on STR assignments. Accordingly, more of the unassigned samples might really belong to
those overlapping branches. In such
cases, it might be better to weight the branch frequencies somehow. I never came up with a satisfactory
method. This issue is becoming moot,
because almost all large haplogroups defined using 67 markers now have
definitive SNPs. As samples are tested
for these SNPs my distribution based on 67 markers will no longer be
necessary; the SNP based distribution
will take over.
There is a related issue that
applies both to 67 marker distribution and SNP distribution: Very small haplogroups may be
overlooked. If a very small branch has
no samples in the edited database, it may actually have a few unidentified
samples in the larger database (samples with fewer markers or with insufficient
SNP tests). This is a well know
statistical issue. (We divide by N-1
instead of N when calculating a standard deviation for a similar issue -
proportional neglect of outliers in a small sample size.) So the table above really should have more
short white rows for small haplogroups, although I suppose most of those might
be shorter than the thickness of the black lines.
Other Statistical Issues. There are other non-random statistical biases that need
consideration, discussed above in the topics Polish
Project vs “Poland” and Issues with “Poland”. These are the ones that cannot be easily
quantified. In my estimation, issues
such as these are usually the largest sources of uncertainty. That’s the case in any Y-DNA frequency
tables, not just for Poland.
Revision History
2011
Sep 10 Initiate this document by moving table from PolishCladesl.html
2011
Four updates after Sep 10
2012
Feb 8 update the first 3 columns of the Summary Table
2012
Feb 11 finish update of the Summary Table
2012
Feb 12 update the detailed Results Table, except R section is not updated yet
2012
Feb 13 new Table of Largest Clades in Results
2012
Feb 14 Results update finished
2012
Feb 15 multiple edits
2012
Mar 8 Kx correction
2012
Mar 8 new columns for analysis file links;
some links added; not all yet
2012
Mar 9 finish analysis files links
2013
Nov 6 Update of the Summary Table (full table not updated yet)
2013
Nov 8 “Largest Categories” new topic
2013
Nov 17 “Largest Terminal Haplogroups and Paragroups” new topic
2013
Nov 18 Update of Summary Table and Largest Categories
2013
Nov 19 Update of Summary Table and Largest Categories
2013
Nov 20 Update of Results Table - the long one
2013
Nov 21 Update; minor edits
2015
Mar 20 new web page “ResultsNew.html” with complete update
2015
Mar 22 a few more subdivision cells added to the table.
2015
Mar 23 more subdivision - finished table;
“Details of Analysis File” finished
2015
Mar 24 more discussion added below the table
2015
Mar 25 more discussion added below the table
2015
Mar 26 discussion topics finished; this
new web page is complete for now
2015
Mar 28 Minor edits throughout this page