Results
Common Polish Y-DNA Clades
26 Feb 2016
Peter Gwozdz
24 Feb 2016: Compete Update. 26 Feb 2016 minor edits
Polish Y-DNA Haplogroup Summary Table
Rewrite 20 Feb 2016.
This table displays haplogroups that are most common in
Poland. The table is arranged as a tree
laying down, with branches going to the right.
The height of each branch is proportional to the percent frequency. Haplogroups smaller than 1% are not shown,
so only the first column largest branches add up to 100%.
More detailed discussion is below
the table. There are statistical issues and bias
issues regarding tables such as these, so the percent frequencies should be
viewed as approximate; percent is given
to one decimal place in order to clearly communicate relative frequencies in
the database analysis.
For full details of the data and
analysis, see the file ResultsTable.xls.
For a more basic discussion about
what all this means, please see the Abstract
in the main web page.
Haplogroup |
% |
Haplogroup |
% |
Haplogroup |
% |
Haplogroup |
% |
Haplogroup |
% |
Haplogroup |
% |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
↑C |
↑0.2% |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||
|
|
CEFGH |
8.1% |
E (M96) |
5.2% |
E1b1b1 |
5.1% |
E-V13 |
|
|
2.8% |
|
|
|
|
|
|
(M35) |
|
E-Z827 |
|
|
1.9% |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
FGH |
2.7% |
G (M89) |
2.6% |
G2a (P15) |
|
|
1.5% |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
I1a-Y6349 |
|
I1a-P Type |
|
|
1.4% |
|
|
|
|
I1a (DF29) |
6.2% |
I1a-L22 |
|
|
|
|
1.5% |
|
|
|
|
|
|
I1a-Z58 |
|
|
|
|
1.3% |
|
|
|
|
|
|
I1a-Z63 |
|
|
|
|
1.5% |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
C_to_Q |
|
|
|
|
|
|
|
|
|
|
|
37.8% |
I J |
22.8% |
I2 (P215) |
9.3% |
I2-CTS5966 |
6.4% |
I2-S17250 |
5.1% |
I2-Y4882 |
2.3% |
|
|
|
|
|
|
|
|
|
|
I2-Y4460 |
1.3% |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
I2a2 (M223) |
1.9% |
I2-L801 |
|
|
1.4% |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
J1-YSC234 |
|
|
1.1% |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
J1 (M267) |
2.43% |
J1-Z18207 |
1.3% |
J1-L816 |
1.1% |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
J (M304) |
7.3% |
J2 (M172) |
4.9% |
J2a (M410) |
3.18% |
J2-M67 |
1.7% |
|
|
|
|
|
|
|
|
|
|
J2-Z222 |
1.4% |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
J2-M575 |
1.7% |
J2-M241 |
1.5% |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
LTNQ |
6.9% |
N (M231) |
5.2% |
N-L550 |
5.0% |
N-M2783 |
4.6% |
N-Z16975 |
1.9% |
|
|
|
|
|
|
|
|
|
|
N-Z16981 |
1.1% |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Q1 (L232) |
|
|
|
|
|
|
1.1% |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
R1a_L260 |
|
|
|
R1a_Y2905 |
|
|
|
|
|
|
|
|
|
|
3.5% |
R1a-YP1364 |
1.7% |
||
|
|
|
|
|
|
|
|
R1a-Y4135 |
|
|
1.7% |
|
|
|
|
|
|
|
|
|
|
R1a-YP589 |
1.4% |
R1a1 |
49.5% |
R1a- |
23.2% |
P type |
14.7% |
R1a-YP254 |
9.7% |
R1a_YP414 |
4.2% |
R1a-YP610 |
2.8% |
|
|
|
|
|
|
R1a-YP654 |
|
|
|
|
2.4% |
(M459) |
|
M458 |
|
R1a- |
8.2% |
R1a-L1029 |
4.7% |
|
|
|
|
|
|
|
|
CTS11962 |
|
|
|
R1a-YP593 |
|
|
1.4% |
|
|
|
|
N type |
|
R1a_YP515 |
|
|
|
|
3.5% |
|
|
|
|
|
|
|
|
R1a-E Type |
|
|
1.2% |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
R1a-Z92 |
|
|
4.2% |
R1a-Z685 |
2.3% |
R1a-CTS4648 |
1.4% |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
R1a-YP343 |
|
|
2.2% |
R1a-YP371 |
1.4% |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Y2613 D Type |
2.0% |
R1a-Y2608 |
1.5% |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
R1a-L269 Gb type |
1.0% |
|
|
R1a- |
|
R1a- |
|
R1a- |
|
R1a-YP237 |
6.7% |
R1a_YP389_G_type |
2.2% |
|
|
|
|
|
|
|
|
|
|
R1a-YP977 J type |
1.9% |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
R1a-L1280 |
1.1% |
|
|
Z280 |
23.2% |
CTS1211 |
18.5% |
CTS3402 |
15.9% |
R1a-Y33 |
7.1% |
R1a_S18681_I_type |
2.5% |
|
|
|
|
|
|
|
|
|
|
R1a-Y2902 B type |
3.1% |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
R1a-Z93 |
|
|
|
|
|
|
2.6% |
R1a-Z2122 |
1.9% |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
R1b1a |
|
R1b- |
|
R1b-U106 |
4.5% |
R1b-L48 |
3.3% |
R1b-L47 |
1.6% |
R1b-L47 P type |
1.2% |
|
|
|
|
|
|
R1b-Z9 |
|
|
1.2% |
||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
R1b-DF27 |
1.3% |
R1b-Z196 |
1.2% |
|
|
|
|
|
|
|
|
|
|
|
|
(L754) |
12.5% |
M269 |
12.3% |
R1b-312 |
|
|
4.5% |
R1b-U152 |
2.1% |
R1b-L2 |
1.7% |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
R2↓ |
↓ 0.2% |
|
|
R1b-Z2103 |
|
|
2.2% |
R1b_Y5587_L23EEType |
2.0% |
R1b-BY593 |
1.5% |
|
|
|
|
|
|
|
|
|
|
|
|
Total |
100% |
|
|
|
|
|
|
|
|
|
|
Discussion of
the Haplogroup Summary Table
Rewrite 20 Feb 2016.
The table shows the percent samples (men) for each haplogroup. The total database: those samples in the Polish Project that indicate
“Poland” as the earliest known male line ancestor’s origin.
There are issues involved with
“Poland” because the borders of Poland have changed often, and Poland as a
country did not exist for more than a century;
I discuss some of these issues in the two “Poland” topics below.
There are statistical issues
involved with percentages calculated from such a database; I discuss some of these issues in the Statistical topic below.
The Nov 2013 version of this table
is still available at Results2013.html. That old version is based on the entire
Polish Project, taken as representative of historical
Poland. You can compare the different
percent results for that old version to this current version, which is
restricted to “Poland”. The Mar 2015
version, also restricted to “Poland” is also available for comparison of how
the % values change as data accumulates;
see Results2015.html.
This new version also has many new
haplogroups, corresponding to new SNPs. There has been a flood of new SNPs over the
past couple years, so this new version has more detailed categories. Watch this page for future updates.
For more details, my analysis file ResultsTable.xls
is also available on-line.
I tried a few versions of different
browsers to view the Table above.
Almost all browsers display the table as I intend, but there are
exceptions with minor problems. Sorry
if your browser has display problems, for example by including lines within the
cells, where I intend each cell to represent a haplogroup, solid color,
bordered by lines in a rectangle. The
cells to the right are subdivision haplogroups. Like a tree, but with branches going right. Some common Y-DNA haplogroups are not
included because they are not common in Poland.
Very small haplogroups, and residual
paragroups, are white, not displayed
as cells, no percent displayed.
Several haplogroups were discussed
in my web pages as hypothetical clades,
called types before the corresponding SNP
were discovered, so these type names are also included in the table. A couple of these types in the table have
not yet been validated as
SNP haplogroups.
The height of each haplogroup cell
is proportional to the percent. The
full table is 10 inches, so 1% is a tenth of an inch. 1.0% is my minimum cell for display. Much smaller if viewed on a cell phone. However, you can view my complete listing of all categories in my
Excel analysis file, which is discussed in more detail in the following topic.
The font of the table can be small
on cell phones or tablets, but you can use your fingers to expand to read, and
then move around the table. On
desktops, change the Zoom setting as needed to view the small font cells.
The left side of the table is based
on all samples, including those with only 12 STR
markers, so the statistics is better on the left. The far right of the table is based proportionally on only those
few samples for which the appropriate SNPs have been tested, because those
haplogroups cannot be predicted by STRs, so the right side is based on fewer
samples, which means lower accuracy.
Please consider the far right column as estimates based on few
samples; the accuracy will improve in
the near future as more samples are tested for these new SNPs. The central region of the table is
intermediate, because samples with 67 STR markers can be confidently predicted
to the larger branches even without SNP data.
For more details, please see the following topics.
The ISOGG (http://isogg.org/tree/) codes for the largest
branches in the table are based on the ISOGG tree in late Jan / early Feb,
2016; the ISOGG codes change as new
SNPs are discovered. Most of the table does
not have full ISOGG codes because they are not practical for small
branches. The Yfull tree (http://www.yfull.com/tree/) was used for
haplogroups not listed by ISOGG.
Rewrite 21 Feb 2016.
Please refer to my analysis file ResultsTable.xls.
The last sheet of that analysis
file, far right, is named “Results Table” sheet. This sheet provides the data for the Haplogroup Summary Table at
the top of this web page. That file
opens to this sheet in a split format;
column B on the left provides the full list of Polish Project categories; column BF on the right provides the percent
frequency for each category. This list
includes the % for all categories, including the categories that are white
(blank, no text) in the Haplogroup Summary Table.
Column BH combines categories
smaller than 0.5%, so column BH has twice the resolution of the table at the Haplogroup
Summary Table.
To the right of column BH you can
see how the smaller haplogroups were combined to produce the data for the six
columns of the Haplogroup Summary Table at the top of this page. In that Excel file, the percents in each
column add to 100.00% (totals in row 3).
Column C is the ISOGG sequence of
Y-DNA haplogroups corresponding to the Polish Project Categories. The Categories are arranged in ISOGG
order. Blank lines are inserted for clarity.
Columns D through W provide an
outline in tree format, listing the primary SNP corresponding to each Category.
Columns AL to BE have the math
equations for proportional
distribution. I am assuming that those
few samples with sufficient SNP data are proportionally representative of the
percent for their categories. Of course
this is not exactly reliable for a small number of samples; I discuss this further in the Statistical topic below. This proportional representation is
mathematically equivalent to proportionally assigning all the samples without
sufficient SNP data. Columns AL to BE
do that.
The first sheet on the left, “Polish
Project”, has a download of Polish Project sample data on 18 Jan 2016. There were 3434 samples on that date, but
200 requested “No Public Display”, so 3234 samples are listed in that first
sheet.
This database is edited
for Family Sets as mentioned below with a link to
more explanation. The edit code is
column K in that first sheet. There are
88 samples with “E” for edit removal, leaving 3146 samples copied to the next
sheet, Poland Edit.
The “Poland Edit” sheet has samples
with “Poland” origin sorted to the top.
The “Not Poland” samples, sorted to the bottom, almost half the project,
have quite a few with obviously Polish names, but I did not attempt to sort by
name. I discuss this further in the
following two topics.
That leaves 1604 samples for
analysis. They are copied to the
following sheet, “Category Counts” for count of samples by category. The category counts are copied to the
“Results Table” sheet. Notice that the
first sheet numbered the Polish Project samples in sequence; the number of the first sample from each
category is available in column A in the “Results Table” sheet, where the
categories are rearranged in ISOGG order;
the original alphabetical project sequence can be seen in column A.
Edit 21 Feb 2016.
In my Fall 2009 publication I discuss
reasons for considering the Polish Project as representative of historical
Poland. Accordingly, my table of
haplogroups have in the past been based on all the Polish Project samples with
Y-DNA STR data. Since 2015, I restrict the data to samples that indicate “Poland”
as the place of origin for their earliest known male line ancestor. There are issues with the use of
“Poland”. These issues are discussed in
the next topic. This topic provides my
reasons for the change.
The basic reason: the Polish Project now has too many samples
from the countries bordering Poland.
Many if not most of these are probably not from the region of historical
Poland.
There are good reasons for allowing
men to join the Polish Project even with ancestry from regions somewhat beyond
the largest borders of historical Poland.
The Y-haplogroups for these regions are present in Poland, and more data
is helpful in the task of determining the branching of haplogroups. Discovery of the branching SNPs is one goal of the Polish Project.
Another reason: Mayka
has been doing a great job of analysis of Polish Project data, and keeping the
web page up to date. This has attracted
men from border countries to join the project, in order to learn their
haplogroup branch, and in order to see the names of their distant male line
relatives. This fills another goal of
the Polish Project. Some of the neighboring
countries do not have such well managed projects.
A big reason: Effort:
It takes a lot of effort for Mayka to determine which samples are not
from the region of historical Poland.
In most cases in the past, the men needed to be contacted by email. Mayka had been analyzing such samples, and flagging them as “No Public
Display” so they were not displayed in the public web pages. With the growth of the Polish Project, this
effort has become too great. Some
samples obviously not connected to the region of Europe near Poland are still
flagged with “No Public Display”.
I mention above that more data is
helpful. Restriction to “Poland” means
less data, which means less statistical accuracy. This was another of my reasons for considering the full Polish
Project in the past. The Polish Project
has grown a lot over the years, however, so the restriction to “Poland” is no
longer a very serious issue.
There are other Y-DNA projects that
could be combined with the Polish Project data to produce a larger “Poland”
database. However, men with “Poland”
provided for origin who do not join the Polish Project may be relatively less
certain of their Polish male line ancestry, so the Polish Project may well
serve as a more concentrated database regarding Polish ethnicity. Ethnicity is discussed in the following
topics.
There are issues involved with using
self assignment to “Poland” as the criterion for a database. These issues are discussed in the following
topic. However, there are also issues
in using self assignment to a geographical region as the criterion, so my
recent change to “Poland” is primarily a change of issues and interpretation,
as discussed in the following topics.
Edit 21 Feb 2016.
For the Haplogroup Summary Table, my
database is those samples from the Polish
Project that indicate “Poland” as the earliest known male line ancestor’s
origin. There are issues with this use
of “Poland”:
Partition of Poland. “Austria” or “Russia” or “Prussia”
or “Germany” were used by many Polish immigrants in other countries (mainly USA
around 1900) in documentation of place of birth, because Poland did not exist
from 1795 to 1918. Poland was split
three ways. The Austrian sector was in
the south. The Russian sector was in
the east. The Prussian (now German)
sector was in the west. Some men, when
submitting their DNA data, correct this and enter “Poland” if they consider
their male line Polish, while others who consider their male line Polish just
copy the documentation for origin as reported in the immigration document. The uncorrected data reduces the “Poland”
database and thereby reduces the accuracy of haplogroup percentages. Data may have a bias if correction from
certain sectors is more common. For
example, maybe Polish immigrants from Russia said “Poland” as their origin less
often than Polish immigrants from Austria said “Poland”; or vice-versa; we don’t know which way it might bias. Insofar as some haplogroups are more common in some sectors than
others, this may introduce a bias in the percentages of haplogroups in Poland.
In the time between WWI and WWII,
a large part of what is now Belarus and Ukraine was in the eastern part of
Poland. Again, some men who consider their
male line Polish may correct their documentation to “Poland”, if the
documentation says “Belarus” or “Ukraine”.
Here, there is also the opposite problem: some men from Belarus or Ukraine who do not consider their male
line Polish may enter “Poland” if their documentation so indicates. Similarly, some of what is now western
Poland was Germany in the past.
Commonwealth. During the
centuries before the partition of Poland, Poland was a “Commonwealth” with a
much larger area. The Commonwealth of
Poland was in existence well before the era of nationalism; the Commonwealth was more like the United
States today; people who lived in the
Commonwealth were considered Polish because they lived in the
Commonwealth. These people were a mix
of many different nationalities (ethnicities).
Up until WWII, in the region considered “Poland”, there were many
neighborhoods of people who spoke a language other than Polish, with a
different culture; these people may or
may not have considered themselves Polish.
Similarly, there were many neighborhoods of people who spoke Polish in
countries outside the then current Polish borders. Today, because of WWII and the aftermath, Poland is not very
ethnically diverse by comparison.
Haplogroup J, in particular with regard to Ashkenazi Jews, is an example
of a haplogroup over represented in the table above in comparison to modern
Poland, because of diaspora from outside Poland who enter “Poland” for male
line ancestry. I did not edit in this
regard; I suppose many Ashkenazim in
previous centuries considered themselves Polish, so in a sense the table above may
be more representative than Poland itself - of who considered themselves Polish
over the centuries.
Geographical vs Ethnic DNA data. In this sense, the Polish Project is more representative of
historical Poland than current Poland.
The Polish Project is not solely representative of ethnic Polish people. Restriction to “Poland” samples provides a
more ethnic database. More discussion
about this is below.
Polish family names can be seen in DNA data with male line
origin given as USA, or England, or other countries not close to Poland. Obviously, these are mostly men who have not
found documentation for their very distant ancestors, so they state the
birthplace of their known ancestor, perhaps a father or grandfather. I considered increasing the database by
including data with Polish family names, but that is difficult to do, and that
produces a problem that I consider overwhelming: many Polish people have German or Russian sounding names even
though their ancestors were all Polish for as many generations back that they
can trace. Some of these may be male
lines that moved to Poland long ago from elsewhere, but many if not most of
these are probably male lines where the family names were Germanized or
Russianized, in the same way that many Polish names today are Anglicized in the
USA. I saw many instances of such
family name changes in 19th century Poland when studying the Polish microfilms
of vital records while researching my family tree. So I do not add Y-DNA data from other countries given as origin
based on Polish names, and I do not remove “Poland” data for names that to not
sound Polish. Such a name based study
may be an interesting experiment.
The Polish Project has a very low
percentage of samples with non-Polish names, from a man who knows his father or
grandfather was adopted from an orphanage, and where there is suspicion that
his male line is Polish. Or other
similar special situations. Although
rare, these indicate how difficult it would be to restrict Polish Project
membership without the danger of excluding someone with a genuine interest in
the Polish Project.
Geographical issues such as
the previous paragraphs are not unique to Poland. The same types of issues come up in an analysis of Y-DNA from
most countries. But these issues are
particularly applicable to Poland, because of the wide swings in the border of
the country of Poland over the centuries, and because of the large Polonia
diaspora, and because Poland was very cosmopolitan in the past.
Biskupski published a
delightful essay regarding the ambiguity of the word “Polish”. (Biskupski M, “Who is a Pole and Where is
Poland?” Rodziny - The Journal of the
Polish Genealogical Society of America (PGSA) Summer 2006:5-12 - can be
purchased through PGSA.) Biskupski
considers all these issues, and also other ambiguities that I do not cover in
this brief discussion. Biskupski does
not provide a definition of “Polish”;
in fact he avoids such a definition, preferring to let people define
themselves as Polish because “it feels right”, quoting Tuwim: “Yestem Polakiem bo tak mi się
podoba.” “I am a Pole because
that feels right to me.” Note that
I follow Biskupski’s advice in the paragraphs above with the phrase “consider
their male line Polish”. Biskupski
points out the problem with restricting “Polish” to the current borders of
Poland: with such a restriction famous
Poles such as Kosciusko, Sobieski, Paderewski, and others would not be
considered Poles. Biskupski also points
that out if we expand the borders to historical Poland, then Tchaikovksy,
Dostoevski, Nietzche, and others become Poles.
Clearly, borders cannot be used alone as a supposed definition of who is
a Pole.
Self assignment as male
ancestral origin “Poland” can of course be a problem for any statistical
study. No doubt some people are wrong
about their male line ancestry, or self deluding. It may be that men uncertain of their origin may be more likely
to consider Y-DNA tests and more likely to join the various projects, so maybe
DNA projects have a higher concentration of uncertainty than the general
population. On the other hand, men who
state “Poland” as origin are surely more likely to be of Polish ethnicity than
men in the Polish Project who state another country. Conversely, men who do not consider themselves Polish may join
the Polish Project because their male ancestor came from the region of
historical Poland, and those men are more likely to not state “Poland” for the
origin. In this regard, my tables since
2015 are more ethnically Polish than my previous tables. I have evidence for this, in the following
topic.
Ethnicity. Some
sociologists deny the genetic validly of the very concept of ethnicity because
of all the migrations and mixing over the millennia. I consider my analysis of Polish Project data to be my
contribution to the measure of the degree of genetic validity of “Poland” as an
ethnic term.
Accuracy and uncertainty. It seems to me all these issues about the meaning of a self
assigned “Poland” ancestral origin may provide the largest uncertainty in the
percentage frequency results reported here.
Certainly in the largest branches (to the left in the Haplogroup Summary
Table), where the number of samples is large, and where statistical accuracy is
best, the ambiguity of “Poland origin” is the largest uncertainty of the
results. I have no method to quantify
this large uncertainty. On the far
right, with branches as small as 1%, statistical uncertainty surely dominates,
which is quantified below.
Justification. Despite all
the caveats discussed in this topic, the Polish Project Y-DNA “Poland origin”
data is the best database I know of for the analysis of Y-DNA haplogroup
frequencies for Poland.
Data Analysis
Edit 22 Feb 2016.
I use L260 as an example; the same kind of analysis can be done for
many of the haplogroups at the top
of this page, regarding evidence of ethnic concentration.
L260 is the largest Y-DNA haplogroup
with significant concentration in Poland and with much lower concentration
elsewhere. Before L260 was discovered,
I identified this clade (hypothetical
haplogroup) with the name P type. In my Fall
2009 publication I reported the frequency of P type Y-DNA as “about 8%” in
Poland. In Fall 2010 (same journal,
JOGG) I reported the discovery of L260 and showed that it seems equivalent to P type, with
very few P type outliers (samples that
marginally match P type with 67 STRs but
test negative for L260, and even fewer L260+ samples that do not match P
type.).
In the table above, L260 P type is
much higher, at 14.7%. This is
primarily because the data is restricted to “Poland” origin samples. That previous 8% was using the full Polish
Project, representative of historical Poland.
In other words, L260 is significantly concentrated in ethnic Poles, even
within the geographical region of historical Poland.
Other haplogroups, less frequent
among ethnic Poles, are coming out lower in the new table. These can be seen by comparing the current
table at the top of this web page to my 2013 table at Results2013.html. Examples:
Haplogroup E was 6.0% in the 2013 Polish Project, but is 5.2% now when
restricted to “Poland”. Haplogroup N
was 8.8% in the 2013 Polish Project, but is only 5.2% now when restricted to
“Poland”. Both E and N are widely
distributed European haplogroups, which apparently do not happen to have any
large sub group concentrated in Poland.
In the near future we may have enough data to identify very small E and
N branches that are concentrated in Poland, although it will be very difficult
to associate these with specific migrations long ago.
Edit 28 Mar 2015:
I am aware that Y-DNA data can be
misinterpreted with regard to ethnicity.
There has been a lot of migration of ethnic peoples in Europe over the
centuries. Ethnic peoples have
mixed. DNA has been scrambled due to
recombination, but Y-DNA is unusual because the Y chromosome does not recombine
(except for a very small part). The Y
chromosome alone does not indicate ethnicity for individuals. Each of us has more than 1,000 ancestors
about 300 years ago. At 10 generations,
2 to the 10th is 1,024, so each of us has 1,024 branches in our pedigree tree
for ancestors going back 10 generations.
30 years is a bit high per generation on the average, so the number of
ancestors is greater than 1,024 at 300 years.
Our Y chromosome is associated with our male line, which is only one of
those 1,000+ ancestral lines 300 years ago.
Going back more than 300 years is much more dramatic.
That said, the Y is interesting
because certain Y haplogroups are concentrated in certain geographical regions,
which is evidence of concentration in the past, and which is a hint regarding
origins and migrations. P type (L260)
is one of the more dramatic, with high concentration only in Poland, among
ethnic Poles. Obviously, the mixing of
people was not enough to homogenize the Y chromosome across the continent, or
even across the region of the Polish Commonwealth.
Rewrite 23 Feb 2016.
Small number of samples: When % frequency of a small haplogroup is calculated, based on a
small number of samples, there is a well-known uncertainty in the result,
expressed as a confidence range. Poisson
statistics is used for calculation. I explain
this in my Fall 2009 publication
about types, and I provide a macro for
calculation of confidence range in my Excel files for types; the master file is Type.xls, with a
demonstration of the Poisson statistics macro on the far right of sheet
“SBP”. As a small sample example, the
very short cell at the top of this web page, haplogroup C (top haplogroup, not
colored), has only 4 samples, with frequency 4/1604 = 0.249%. The 70% confidence range is 0.13% to
0.45%; the 90% confidence range is
0.09% to 0.57%. In other words, future
data for haplogroup C will likely come out lower than 0.249% or higher than
0.249%; we are not confident of the
exact frequency in the large worldwide population of men with Polish male
ancestry who might join the Polish Project.
Technical caveat: my macro only considers the number of
samples in the haplogroup or type, not considering the uncertainty due to the
number of samples in the database, because the latter is very large in this
case. My macro uderstates the
confidence range for a frequency where both the numerator and denominator are
small.
In general, the left side of the
table is based on hundreds of samples and the center of the table is based on
dozens of samples. Uncertainties due to
number of samples are very small compared to the other statistical issues, below,
and compared to the non-random biases discussed above in the “Poland” topics.
On the right side confidence range
is wider. For example 1.0% corresponds
to 16 samples. 70% confidence range is
0.74% to 1.33%.
Some of the short white rows (cells
not colored, no text with percent) in that table are based on fewer than 10
samples, with wide confidence range.
The taller white cells have better confidence, based on a paragroup combination of small
haplogroups.
The uncertainty for haplogroups at less
than 1% is the main reason I restricted the Haplogroup Summary Table to
haplogroups with at least 1%. However,
there is a complication even for some haplogroups at greater than 1%, due to proportional distribution. As an extreme example, consider the
haplogroup R1b-BY593, listed at 1.5%, the bottom entry in the table. 1.5% normally represents 24 samples in the
database. However, R1b-BY593 has only 3
samples, so it may seem the confidence range should be wider than the example
above (haplogroup C with 4 samples).
R1b-BY593 was proportionally adjusted:
The “father” of BY593, L23EE Type, has been confidently predicted for
years, based on STRs. L23EE Type now seems equivalent to R1b-Y5587. R1b-Y5587 L23EE Type is listed in the table
at 2.0%, with relatively high confidence.
L23EE Type has 22 samples, but only 4 of them have been tested for the
new SNP BY593; 3 are positive and one
negative. On this basis, we can predict
that about 3/4 of L23EE should turn out to be BY593, but the confidence range
is quite wide; the binomial
distribution should be used, 3 samples out of 4; I won’t calculate it here but it’s obviously wider than the 4
sample example above. All the words of
these past few sentences are inherent in the equations in the above described
Excel spread sheet that proportionally distribute the L23EE samples. In addition, L23EE itself had a proportional
distribution, of the R1b-M269 samples with insufficient data for prediction.
That L23EE example brings up another
fine point: There are 65 M269
Unassigned samples that were proportionally distributed. Most of those have fewer than 67 STRs. However, some of those do have 67 or 111
STR, and those cannot be L23EE or any of the other confidently predicted
clades. Technically, I should have
proportioned those with 67 or more STRs into the clades that cannot be
predicted, perhaps into hypothetical small clades that have not yet been
discovered. This seems like too much
effort for marginal improvement, so I did not do it. This paragraph alerts the reader to yet another source of
uncertainty in the calculated frequencies in the table, taken as a prediction
of future data, and as a measure of frequencies in ethnic Poland. Also, the non-random systematic
uncertainties discussed above, due to what it meant by “ethnic Poland”, and due
to other issues mentioned in this web page, are probably larger that the statistical
uncertainties for most clades.
I have not calculated and tabulated
all confidence ranges here: I do that
in my table with Poland
Concentration Index for the haplogroups most concentrated in Poland. That “PCI” table needs to be updated.
Recruitment Bias; Edits for Family
Sets: My explanation of editing for Family Sets is at the main web page.
Briefly, family sets produce a recruitment bias to haplogroup frequencies. This is a non-random statistical issue that
can be reasonably corrected by editing.
The first sheet “Project” in ResultsTable.xls
has my edit code in column K. The 110
“E” samples are removed from data analysis, as detailed above. However, a small unknown accuracy issue
remains insofar as the edit for family sets is not perfect, as explained at the
main web page.
Revision History
2011
Sep 10 Initiate this document by moving table from PolishCladesl.html
2011
Four updates after Sep 10
2012
Feb 8 update the first 3 columns of the Summary Table
2012
Feb 11 finish update of the Summary Table
2012
Feb 12 update the detailed Results Table, except R section is not updated yet
2012
Feb 13 new Table of Largest Clades in Results
2012
Feb 14 Results update finished
2012
Feb 15 multiple edits
2012
Mar 8 Kx correction
2012
Mar 8 new columns for analysis file links;
some links added; not all yet
2012
Mar 9 finish analysis files links
2013
Nov 6 Update of the Summary Table (full table not updated yet)
2013
Nov 8 “Largest Categories” new topic
2013
Nov 17 “Largest Terminal Haplogroups and Paragroups” new topic
2013
Nov 18 Update of Summary Table and Largest Categories
2013
Nov 19 Update of Summary Table and Largest Categories
2013
Nov 20 Update of Results Table - the long one
2013
Nov 21 Update; minor edits
2015
Mar 20 new web page “ResultsNew.html” with complete update
2015
Mar 22 a few more subdivision cells added to the table.
2015
Mar 23 more subdivision - finished table;
“Details of Analysis File” finished
2015
Mar 24 more discussion added below the table
2015
Mar 25 more discussion added below the table
2015
Mar 26 discussion topics finished; this
new web page is complete for now
2015
Mar 28 Minor edits throughout this page
2016
Feb 24 Complete Update
2016
Feb 25 Minor edits of table