We consider the p-Norm Hamming Centroid problem which asks to determine whether some given binary strings have a centroid with a bound on the p-norm of its Hamming distances to the strings. Specifically, given a set of strings S and a real k, we consider the problem of determining whether there exists a string s∗ with (∑s∈Sdp(s∗,s))1/p≤k, where d(,) denotes the Hamming distance metric. This problem has important applications in data clustering, and is a generalization of the well-known polynomial-time solvable Consensus String (p=1) problem, as well as the NP-hard Closest String (p=∞) problem.
Our main result are
-
p-Norm Hamming Centroid is NP-hard for all fixed rational p>1, closing the gap for all rational values of $p$ between $1$ and $\infty$.
-
Under standard complexity assumptions the reduction also implies that p-Norm Hamming Centroid has no 2o(n+m)-time or 2o(kp(p+1))-time algorithm, where m denotes the number of input strings and $n$ denotes the length of each string, for any fixed p>1. Both running time lower bounds are tight. In particular, we provide a 2kp(p+1)+ε-time algorithm for each fixed ε>0.
-
In the last part of the paper, we complement our hardness result by presenting a fixed-parameter algorithm and a factor-2 approximation algorithm for the problem.
For a comparison, the following cluster S with 5 strings, each of length 7, shows that for different p, we indeed obtain different optimal centroids. For each p∈{1,2,∞}, string s∗p is an optimal p-norm centroid but it is not an optimal q-norm centroid, where q∈{1,2,∞}∖{p}. Moreover, one can verify that s∗2 is the only optimal 2-norm centroid and no optimal ∞-norm centroid (closest string) is an optimal 2-norm centroid.
====================================
The paper can be found on arXiv.